Fix out of sync subtitles with Python!

This is an update of my old post from a couple of years ago.

[Edit – October 2015]: I created an in-browser version of subslider, called subslider.js. Just visit this page and follow the instructions, or read this blog post if you want to know more about it!

After using that script quite a few times, and loving it, I decided to give it a facelift and add the one feature that I’ve been wishing it had for all this time: the option to just tell it the timestamp of the first dialog without performing any math 🙂

Yeah, it’s simple math, but having to use base 60 means more brain CPU time wasted (and above all, it means more time separating me from my movie!).

I also moved the code to GitHub, so you can find it here: SubSlider. And this is the direct link to the python script, for the impatient.

The old way of specifying offsets using +/- has been replaced my a more argparse-standard system of flags. Also, the new feature I mentioned above can be used by running the script like this:

python subslider.py -s 1:23,450 MySubFile.srt

assuming your subtitles file is called MySubFile.srt and assuming that the first dialog in the movie takes place at 1:23,450. This time, there’s an “interactive” dialog that asks you to choose the first line among the first 10 lines in the .srt file. I added it because sometimes you get the equivalent of opening titles in the .srt, and that doesn’t help when you’re synchronizing.

If you want to get a different number of lines, you can simply change the LINES_TO_SHOW variable at line 43 to whatever number you prefer.

As always, feel free to contribute 🙂

Advertisements

Fix subtitles offset with python!

[UPDATE – May 25, 2014] I revamped this script, moved it to GitHub, and wrote a new post about it!
[UPDATE – May 19, 2013] Script updated to support Python 3!

One of the most common problems with subtitle files, especially with TV series subtitles, is that they often start all too late because you have a version of the video file containing opening titles (or ‘previously on MyFavoriteSeries’ sequences) and the subtitles don’t account for them, or the other way around.

Of course, once you’ve fixed this offset the subtitles are fine, as the movie is played at the same rate in all versions.

My beloved XBMC has a function to sync subtitles, but it’s more of a fine-tuning thing, you can’t specify a very large offset (last time I checked) and it takes some time to actually reload the subtitles and show you the results.

I developed a small script in python to do just that, as I thought that it would have been quicker to write it than to look for it (and it was… at least the quick&dirty version :D). To use it, just open the subtitles with any text editor you like, look for the first dialog and take note of when that dialog takes place in the movie: your offset is the difference between the time in the movie and the one you found in the file. So if the .srt file states that Renly Baratheon says “Do you swear it?” at 00:02:08,883 but in the .avi file it’s actually at roughly 00:03:43,500, your offset is 3:43,5 - 2:08,883 = 94,617 = 1:34,617. Then, you run the script calling

python subslider.py MySubs.srt offset

and your new subs are in MySubs_offset.srt. That’s it!

You can specify positive offsets –like e.g. +15— for when subtitles should be delayed, or negative offsets –like e.g. -30— in case it’s the movie that should be delayed (and subs anticipated).

Offsets can be specified both with decimal notation (as in +94,617, subs delayed by 94.617 seconds) and with time notation (as in -5:07,324, video delayed by 5 minutes 7 seconds 324 milliseconds). Time notation follows the one used in .srt files, so you get a comma as decimal separator.

Here it is, you can save it to a file named subslider.py and run it with python 2.7 ([Update – May 19, 2013] or python 3!).

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# SubSlider - a simple script to apply offsets to subtitles
#
# Copyright May 2nd 2012 - MB <https://somethingididnotknow.wordpress.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>
from __future__ import print_function
from datetime import timedelta, datetime
import os
import re
import sys

class SubSlider:
    """A simple script to apply offsets to subtitles.

    Subtitles can be delayed by specifying a positive offset (e.g. +12 or simply 12), or video can be delayed by specifying a negative offset (e.g. -12)"""

    def __init__(self, argv):
        if len(argv) < 2:
            self.usage()
        else:
            self.first_valid = 0
            self.parse_args(argv)
            self.parse_subs()
            self.fix_file()
            os.remove(self.output_temp)
            print('Success! Offset subs have been written to %s' % os.path.abspath(self.output_subs))

    def usage(self):
        print("""usage: subslider.py [-h] subs_file offset

Applies an offset to a subtitles file

positional arguments:
  subs_file             The input subtitles file, the one to which the offset
                        is to be applied
  offset                The offset to be applied to the input subtitles file.
                        Format is [+/-][MM:]SS[,sss] like +1:23,456 (new subs
                        will be displayed with a delay of 1 minute, 23 seconds,
                        456 milliseconds) or -100 (subs 100
                        seconds earlier) or +12,43 (subs delayed of 12 seconds
                        43 milliseconds)""")
        sys.exit(1)


    def parse_args(self, args):
        error = None
        if not os.path.isfile(args[0]):
            print('%s does not exist' % args[0])
            error = True
        else:
            self.input_subs = args[0]
            self.output_subs = '%s_offset.srt' % os.path.splitext(self.input_subs)[0]
            self.output_temp = '%s_temp.srt' % os.path.splitext(self.input_subs)[0]
        offset_ok = re.match('[\+\-]?(\d{1,2}\:)?\d+(\,\d{1,3})?$', args[1])
        if not offset_ok:
            print('%s is not a valid offset, format is [+/-][MM:]SS[,sss], see help dialog for some examples' % args[1])
            error = True
        else:
            offset = re.search('([\+\-])?((\d{1,2})\:)?(\d+)(\,(\d{1,3}))?', args[1])
            self.direction, self.minutes, self.seconds, self.millis = (offset.group(1), offset.group(3), offset.group(4), offset.group(6))
        if error:
            self.usage()

    def parse_subs(self):
        with open(self.input_subs, 'r') as input:
            with open(self.output_temp, 'w') as output:
                nsafe = lambda s: int(s) if s else 0 
                block = 0
                date_zero = datetime.strptime('00/1/1','%y/%m/%d')
                for line in input:
                    parsed = re.search('(\d{2}:\d{2}:\d{2},\d{3}) \-\-> (\d{2}:\d{2}:\d{2},\d{3})', line)
                    if parsed:
                        block += 1
                        start, end = (self.parse_time(parsed.group(1)), self.parse_time(parsed.group(2)))
                        offset = timedelta(minutes=nsafe(self.minutes), seconds=nsafe(self.seconds), microseconds=nsafe(self.millis) * 1000)
                        if '-' == self.direction:
                            start -= offset
                            end -= offset
                        else:
                            start += offset
                            end += offset
                        offset_start, offset_end = (self.format_time(start), self.format_time(end))
                        if not self.first_valid:
                            if end > date_zero:
                                self.first_valid = block
                                if start < date_zero:
                                    offset_start = '00:00:00,000'
                        output.write('%s --> %s\n' % (offset_start, offset_end))
                    else:
                        output.write(line)

    def fix_file(self):
        with open(self.output_temp, 'r') as input:
            with open(self.output_subs, 'w') as output:
                start_output = False
                for line in input:
                    if re.match('\d+$', line.strip()):
                        block_num = int(line.strip())
                        if block_num >= self.first_valid:
                            if not start_output:
                                start_output = True
                            output.write('%d\r\n' % (block_num - self.first_valid + 1))
                    elif start_output:
                        output.write(line)

    def format_time(self, value):
        formatted = datetime.strftime(value, '%H:%M:%S,%f')
        return formatted[:-3]

    def parse_time(self, time):
        parsed = datetime.strptime(time, '%H:%M:%S,%f')
        return parsed.replace(year=2000)

if __name__ == '__main__':
    SubSlider(sys.argv[1:])

as always, the same script is also on pastebin.

Whenever applying the offset moves some dialogs before 0:00:00,000 I decided to drop them altogether, starting with the first dialog ending after time 0, making it start at time 0 if start is negative.
The renumbering of dialogs (see fix_file) is something that is not needed, at least by VLC (which I used to test the script). You can have dialogs starting at, say, 42 and VLC is fine with that.

I was a little disappointed with the datetime.strptime function, in that it has no built-in support for milliseconds (only microseconds, and even that only on python2.7+!). The whole date/time/datetime system is not as pythonic as it seems at first sight, so I had to do a couple of little ugly things (as in parse_time and format_time).