Fix out of sync subtitles with Python!

This is an update of my old post from a couple of years ago.

[Edit – October 2015]: I created an in-browser version of subslider, called subslider.js. Just visit this page and follow the instructions, or read this blog post if you want to know more about it!

After using that script quite a few times, and loving it, I decided to give it a facelift and add the one feature that I’ve been wishing it had for all this time: the option to just tell it the timestamp of the first dialog without performing any math 🙂

Yeah, it’s simple math, but having to use base 60 means more brain CPU time wasted (and above all, it means more time separating me from my movie!).

I also moved the code to GitHub, so you can find it here: SubSlider. And this is the direct link to the python script, for the impatient.

The old way of specifying offsets using +/- has been replaced my a more argparse-standard system of flags. Also, the new feature I mentioned above can be used by running the script like this:

python subslider.py -s 1:23,450 MySubFile.srt

assuming your subtitles file is called MySubFile.srt and assuming that the first dialog in the movie takes place at 1:23,450. This time, there’s an “interactive” dialog that asks you to choose the first line among the first 10 lines in the .srt file. I added it because sometimes you get the equivalent of opening titles in the .srt, and that doesn’t help when you’re synchronizing.

If you want to get a different number of lines, you can simply change the LINES_TO_SHOW variable at line 43 to whatever number you prefer.

As always, feel free to contribute 🙂

Create a list of movies to watch with Python and Urlist

I usually like to keep lists of movies to watch, books to read, games to play on Google Keep, mainly because it comes with a widget that looks nice on my phone. Its sharing capabilities though are, well, nonexistent. Yes, you can email a list with a bullet point for each entry but it ends there.

Since I wanted to share a list of movies to watch with my wife, I resorted to Urlist. It’s a neat, straight-to-the-point tool that’s good for sharing links with friends, collaborators, anybody.

I wish they had an API available, so that this post could have been about a tool that automatically creates lists for you (heck, I could even write a simple chrome extension!), but so far there’s none.

Our list is going to have, for every entry, the vote that the movie got on IMDB, a brief summary of its plot and cast. If any of them attracts your SO’s attention, (s)he can just click to see further info about the movie 🙂

The script requires BeautifulSoup and Requests, 2 awesome libraries to scrape the web.

To install them, you can use either pip:

sudo pip install beautifulsoup4 requests

or easy_install:

sudo easy_install beautifulsoup4 requests

Create the list on Urlist, launch the script:

python scrape_IMDB.py

and for every movie you want to add:

  1. search it on IMDB
  2. copy the URL
  3. paste the URL on Urlist to add an entry
  4. paste the URL on the console where the script is running
  5. copy the output of the script
  6. back to Urlist, hit edit and paste what you copied

Here’s the script (you can download it from pastebin):

from bs4 import BeautifulSoup
import requests

done = False

while not done:
  try:
    url = raw_input("IMDB URL: ")

    # get the IMDB page
    r = requests.get(url)
    data = r.text

    # and parse it with BeautifulSoup
    soup = BeautifulSoup(data)

    # the td containing what we're looking for
    td = soup.find('td', {'id': 'overview-top'})
    rating = td.find('div', {'class': 'star-box-giga-star'}).string
    plot = td.find('p', {'itemprop': 'description'}).string
    # the div containing the main actors in the cast
    actors = td.find('div', {'itemprop': 'actors'})
    stars = ', '.join([actor.string for actor in actors.find_all('span', {'class': 'itemprop', 'itemprop': 'name'})])

    print '*%s* - %s. %s' % (rating.strip(), stars, plot)
  except KeyboardInterrupt:
    done = True
print
print 'bye!'

It’s super simple! It gets the page, finds the HTML source for what we’re looking for, and prints it out as formatted text that’s good for Urlist.

The way you find items with BeautifulSoup is relatively similar to what you do with jQuery: you look for elements in the DOM that contain what you’re looking for (to find what they are, just use your browser’s inspector… on Chrome, right click on the text and choose “Inspect element…” to see where it is in the DOM), and manipulate them as strings or arrays of strings.

Easy enough!

HttpPost requests executed multiple times (Apache HttpClient)

This is something I noticed on Android, but from what I read it also involves the desktop Java version.

I was sending POST requests to an API server, and I was getting some random 400 Bad Request responses from time to time. I wish Apache provided an easy way to log the plain text version of Http requests, but I couldn’t find a better way to see what the app was sending than sending the same request to my PC when failing.

So to log requests I start netcat (sudo nc -l 80 on a mac) or a very minimal server in python (it’s more or less the same as the example on Twisted’s front page) and route them there whenever an error occurs.

try {
   response = client.execute(post,
                  new BasicResponseHandler());
} catch (IOException e) {
   if (DEBUG_FAILED_REQUESTS) {
      post.setURI(URI.create(DEBUG_FAILED_REQUESTS_SERVER));
      try {
         client.execute(post, new BasicResponseHandler());
      } catch (IOException e1) {
         e1.printStackTrace();
      }
   }
}

I don’t know if it’s my router, but sometimes connections from the Android device to my PC get blocked: to make them work I just open a browser on the Android, go to some website and then try again with my internal IP (192.168.0.whatever). It always works, no idea why.

Using this code I discovered that my post requests were executed 4 times each, nearly at the same time. I discovered that it’s the default behavior, and you must provide your own RetryHandler if you want the HttpClient to work otherwise.

In my case, my calls are sent to Google’s shortener service, and for some reason sometimes it just rejects requests. If you wait a little bit between attempts you increase your chance of getting valid responses. So this is what I did:

HttpPost post = new HttpPost(SHORTENER_URL);
String shortURL = null;
int tries = 0;
try {
    post.setEntity(new StringEntity(String.format(
            "{\"longUrl\": \"%s\"}",
            getURL(encodedID, encodedAssignedID))));
    post.setHeader("Content-Type", "application/json");
    DefaultHttpClient client = new DefaultHttpClient();
    // disable default behavior of retrying 4 times in a burst
    client.setHttpRequestRetryHandler(new DefaultHttpRequestRetryHandler(
            0, false));
    String response = null;
    while (response == null && tries < RETRY_COUNT) {
        try {
            response = client.execute(post,
                    new BasicResponseHandler());
        } catch (IOException e) {
            // maybe just try again...
            tries++;
            Utils.debug("attempt %d failed... waiting", tries);
            try {
                // life is too short for exponential backoff
                Thread.sleep(RETRY_SLEEP_TIME * tries);
            } catch (InterruptedException e1) {
                e1.printStackTrace();
            }
        }
    }
    Utils.debug("response is %s", response);
    if (response != null) {
        JSONObject jsonResponse = new JSONObject(response);
        shortURL = jsonResponse.getString("id");
    } else if (DEBUG_FAILED_REQUESTS) {
        Utils.debug("attempt %d failed, giving up", RETRY_COUNT);
        debugPost(post, client);
    }
} catch (JSONException e) {
    e.printStackTrace();
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

where debugPost() is a method that calls my PC to log the request, and Utils.debug() is just a small utility method I wrote to log messages with logcat using String.format() if format args are passed to it (it also takes care of splitting messages that would be truncated by logcat itself).

You could choose to implement exponential backoff very easily, but since it’s a blocking operation for the user in my case I preferred not to.

Create a diff for i18n strings.xml files to manage localization on Android

Keeping all your strings.xml files synchronized in Android projects can be painful, as Eclipse doesn’t tell you which strings have no localized version in which language. Android is perfectly happy with it as well, it just uses the default (usually English) string in the app, much for the joy of your non-English users.

I came up with a simple Python script that just scans your res/values-** folders for strings.xml files and, using your default res/values/strings.xml as reference, outputs a list of missing strings for each file, along with the original value set for the key.

So, if for instance your res/values/strings.xml is this:

<?xml version="1.0" encoding="utf-8"?>
<resources>
    <string name="app_name">My App</string>
    <string name="title_activity_main">My Activity</string>
    <string name="hello_world">Hello, World!</string>
</resources>

and your, say, res/values-it/strings.xml is this:

<?xml version="1.0" encoding="utf-8"?>
<resources>
    <string name="app_name">La mia App</string>
    <string name="title_activity_main">La mia Activity</string>
</resources>

and your… res/values-fr/strings.xml? is this:

<?xml version="1.0" encoding="utf-8"?>
<resources>
    <string name="app_name">Mon App</string>
    <string name="hello_world">Bonjour, Monde!</string>
</resources>

the script would output:

Missing in /home/whatever/wherever/.../App/res/values-it/strings.xml:
<string name="hello_world">Hello, World!</string>

Missing in /home/whatever/wherever/.../App/res/values-fr/strings.xml:
<string name="title_activity_main">My Activity</string>

So the idea is that you can cut and paste those lines in the appropriate files to translate them.

The script also outputs some warnings in case it finds duplicate keys in any of the strings.xml files.
Your localized strings.xml files may have more <string> items than the default, as no check is performed against that.

I put the script it in a folder within my Android projects that is simply ignored by Android (I usually call it not_in_apk or something like that), so if you put it elsewhere remember to change the path at line 23

path_to_default = '../res/values/strings.xml'

to the path to your default strings.xml file (absolute or relative, it should work anyway).

I didn’t do much testing, so it may not work for you… Worst thing that can happen is.. it doesn’t work 🙂
It won’t mess with your files, I promise you that.

Here’s the script! Run it with python i18n.py.

Last note: this script only takes strings.xml files into account, you should run Android Lint to check for strings to be translated in other XML files (stringarrays.xml and other files).

Fix subtitles offset with python!

[UPDATE – May 25, 2014] I revamped this script, moved it to GitHub, and wrote a new post about it!
[UPDATE – May 19, 2013] Script updated to support Python 3!

One of the most common problems with subtitle files, especially with TV series subtitles, is that they often start all too late because you have a version of the video file containing opening titles (or ‘previously on MyFavoriteSeries’ sequences) and the subtitles don’t account for them, or the other way around.

Of course, once you’ve fixed this offset the subtitles are fine, as the movie is played at the same rate in all versions.

My beloved XBMC has a function to sync subtitles, but it’s more of a fine-tuning thing, you can’t specify a very large offset (last time I checked) and it takes some time to actually reload the subtitles and show you the results.

I developed a small script in python to do just that, as I thought that it would have been quicker to write it than to look for it (and it was… at least the quick&dirty version :D). To use it, just open the subtitles with any text editor you like, look for the first dialog and take note of when that dialog takes place in the movie: your offset is the difference between the time in the movie and the one you found in the file. So if the .srt file states that Renly Baratheon says “Do you swear it?” at 00:02:08,883 but in the .avi file it’s actually at roughly 00:03:43,500, your offset is 3:43,5 - 2:08,883 = 94,617 = 1:34,617. Then, you run the script calling

python subslider.py MySubs.srt offset

and your new subs are in MySubs_offset.srt. That’s it!

You can specify positive offsets –like e.g. +15— for when subtitles should be delayed, or negative offsets –like e.g. -30— in case it’s the movie that should be delayed (and subs anticipated).

Offsets can be specified both with decimal notation (as in +94,617, subs delayed by 94.617 seconds) and with time notation (as in -5:07,324, video delayed by 5 minutes 7 seconds 324 milliseconds). Time notation follows the one used in .srt files, so you get a comma as decimal separator.

Here it is, you can save it to a file named subslider.py and run it with python 2.7 ([Update – May 19, 2013] or python 3!).

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# SubSlider - a simple script to apply offsets to subtitles
#
# Copyright May 2nd 2012 - MB <https://somethingididnotknow.wordpress.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>
from __future__ import print_function
from datetime import timedelta, datetime
import os
import re
import sys

class SubSlider:
    """A simple script to apply offsets to subtitles.

    Subtitles can be delayed by specifying a positive offset (e.g. +12 or simply 12), or video can be delayed by specifying a negative offset (e.g. -12)"""

    def __init__(self, argv):
        if len(argv) < 2:
            self.usage()
        else:
            self.first_valid = 0
            self.parse_args(argv)
            self.parse_subs()
            self.fix_file()
            os.remove(self.output_temp)
            print('Success! Offset subs have been written to %s' % os.path.abspath(self.output_subs))

    def usage(self):
        print("""usage: subslider.py [-h] subs_file offset

Applies an offset to a subtitles file

positional arguments:
  subs_file             The input subtitles file, the one to which the offset
                        is to be applied
  offset                The offset to be applied to the input subtitles file.
                        Format is [+/-][MM:]SS[,sss] like +1:23,456 (new subs
                        will be displayed with a delay of 1 minute, 23 seconds,
                        456 milliseconds) or -100 (subs 100
                        seconds earlier) or +12,43 (subs delayed of 12 seconds
                        43 milliseconds)""")
        sys.exit(1)


    def parse_args(self, args):
        error = None
        if not os.path.isfile(args[0]):
            print('%s does not exist' % args[0])
            error = True
        else:
            self.input_subs = args[0]
            self.output_subs = '%s_offset.srt' % os.path.splitext(self.input_subs)[0]
            self.output_temp = '%s_temp.srt' % os.path.splitext(self.input_subs)[0]
        offset_ok = re.match('[\+\-]?(\d{1,2}\:)?\d+(\,\d{1,3})?$', args[1])
        if not offset_ok:
            print('%s is not a valid offset, format is [+/-][MM:]SS[,sss], see help dialog for some examples' % args[1])
            error = True
        else:
            offset = re.search('([\+\-])?((\d{1,2})\:)?(\d+)(\,(\d{1,3}))?', args[1])
            self.direction, self.minutes, self.seconds, self.millis = (offset.group(1), offset.group(3), offset.group(4), offset.group(6))
        if error:
            self.usage()

    def parse_subs(self):
        with open(self.input_subs, 'r') as input:
            with open(self.output_temp, 'w') as output:
                nsafe = lambda s: int(s) if s else 0 
                block = 0
                date_zero = datetime.strptime('00/1/1','%y/%m/%d')
                for line in input:
                    parsed = re.search('(\d{2}:\d{2}:\d{2},\d{3}) \-\-> (\d{2}:\d{2}:\d{2},\d{3})', line)
                    if parsed:
                        block += 1
                        start, end = (self.parse_time(parsed.group(1)), self.parse_time(parsed.group(2)))
                        offset = timedelta(minutes=nsafe(self.minutes), seconds=nsafe(self.seconds), microseconds=nsafe(self.millis) * 1000)
                        if '-' == self.direction:
                            start -= offset
                            end -= offset
                        else:
                            start += offset
                            end += offset
                        offset_start, offset_end = (self.format_time(start), self.format_time(end))
                        if not self.first_valid:
                            if end > date_zero:
                                self.first_valid = block
                                if start < date_zero:
                                    offset_start = '00:00:00,000'
                        output.write('%s --> %s\n' % (offset_start, offset_end))
                    else:
                        output.write(line)

    def fix_file(self):
        with open(self.output_temp, 'r') as input:
            with open(self.output_subs, 'w') as output:
                start_output = False
                for line in input:
                    if re.match('\d+$', line.strip()):
                        block_num = int(line.strip())
                        if block_num >= self.first_valid:
                            if not start_output:
                                start_output = True
                            output.write('%d\r\n' % (block_num - self.first_valid + 1))
                    elif start_output:
                        output.write(line)

    def format_time(self, value):
        formatted = datetime.strftime(value, '%H:%M:%S,%f')
        return formatted[:-3]

    def parse_time(self, time):
        parsed = datetime.strptime(time, '%H:%M:%S,%f')
        return parsed.replace(year=2000)

if __name__ == '__main__':
    SubSlider(sys.argv[1:])

as always, the same script is also on pastebin.

Whenever applying the offset moves some dialogs before 0:00:00,000 I decided to drop them altogether, starting with the first dialog ending after time 0, making it start at time 0 if start is negative.
The renumbering of dialogs (see fix_file) is something that is not needed, at least by VLC (which I used to test the script). You can have dialogs starting at, say, 42 and VLC is fine with that.

I was a little disappointed with the datetime.strptime function, in that it has no built-in support for milliseconds (only microseconds, and even that only on python2.7+!). The whole date/time/datetime system is not as pythonic as it seems at first sight, so I had to do a couple of little ugly things (as in parse_time and format_time).

Syntax highlighted code in Gmail with Vim

I used to paste code in emails by simply choosing the monospaced font in gmail’s web client (I don’t like native clients… well, except Android’s), in hope that the (usually) small width of the paragraph didn’t mess up with the code.

Whenever I had the need to paste snippets larger than, say, 4-5 LOC I always ended up using some external website like pastebin; this however resembled attaching files to the email too much, in that you can’t really talk about snippets in the email without having your addressees jump back and forth between tabs, like they would between windows in the attached-file case. In those cases I relied on commenting the code on pastebin, so my emails were actually cut in two (or three, or four,… depending on the number of snippets in the email).

My new process goes like this: I write the code in the editor (Eclipse or Geany), so I have autocompletion, autoformatting and tabs converted into spaces. Then, I copy and paste the code into a new file in Vim and save it as an HTML file using the

:TOhtml

command as I learned reading this article. I open the file with Chromium (any browser will do, of course) and I copy and paste the code right into Gmail’s editor.
This way Gmail nicely inserts the HTML code inside the email, using the styles I chose in Vim, background color included!

You don’t like Vim’s color scheme? You’ll find a lot of color schemes in this wonderful website 🙂

Change color scheme in Geany

After trying Sublime for a while, and quite liking it, I found myself in the middle of a deep customization of the editor… to make it work like Geany!

For quick editing of local bash/python scripts or configuration files there’s no editor that meets my taste better (I said local cause when dealing with remote files my all-time favorite is vim).

The one thing that I like more in Sublime than in Geany is its look: it’s very elegant, but in the end what I really missed in Geany was a dark editor theme.

I discovered this project on GitHub with several available themes, very easy to install (as in execute-install-script-with-no-options) but.. not that easy to choose from Geany’s interface!

[Update – Mar13] – unjordi posted a nice command line one-liner to get and install the geany-dark color schemes, here it is:

wget -qO- http://geany-dark-scheme.googlecode.com/files/geany_dark_filedefs_20100304_190847.tar.bz2 | tar jxv -C ~/.config/geany/filedefs/

After you’ve downloaded the color schemes and before editing the configuration file as described here below try to restart Geany and check if there’s an entry under View/Editor>Color Schemes>; if it’s there you can choose among all installed color schemes from a nice list! 🙂

If you’re out of luck (no list for you) you must edit Geany’s configuration file ~/.config/geany/geany.conf and find the color_scheme line. You must specify the whole file name of the color scheme you wish to use, without its path (it must be in the ~/.config/geany/colorschemes folder anyway).

So, to set your theme to tango-dark you shall have this line in your geany.conf file:

color_scheme=tango-dark.conf

Restart Geany and there you have your nice dark theme 🙂

[Update – Nov13] – a reader had troubles with the configuration file (always reverting to its original state, or not being read correctly by Geany), scroll down to November 5 2013 in the comments if you have the same issues!