Filter out spammers and click bait from Google Analytics

During the last few months, a new wonderful type of spam became part of my life: the Google Analytics spam.

As this article describes, what happens is that you start seeing some blatantly bogus traffic coming from a bunch of websites like semalt.com, buttons-for-website.com, darodar.com, or ilovevitaly.com.

Google announced an Automatic Bot and Spider filtering, but as some users on hacker news reported, it doesn’t work reliably.

So far, the only solution to this problem that worked for me is setting a filter, and add spammers to it as they come. There doesn’t seem to be that many as of today, so this approach is still usable.

[Update – July 2015]: if you have a public HTTP/PHP server available, and are willing to invest half a day to install it, piwik is a nice free, open-source Google Analytics alternative. Piwik uses a community-maintained list of spammers that can also be used in Google Analytics. They wrote a blog post about it, too.

I’ve been using piwik for a few weeks now, and I’m happy with it so far. The nice thing is that updates are very easy to apply, and they include the most recent list of spammers available. The thing that could be improved is the installation process, it’s not as easy as it could be (at least if you’re using Nginx as web server). They also have a Cloud-hosted version, but I guess that if you’re using Google Analytics for free, you’re more interested in free alternatives!

To add a filter in Google Analytics:

  1. go to your Administration page (last tab on your home page)
  2. All filters (on the leftmost column)
  3. New filter
  4. Choose Filter type “Custom” > “Exclude”
  5. Choose “Referral” from the Filter Field menu
  6. Set this as Filter pattern:
    semalt\.com|ilovevitaly\.co|priceg\.com|forum\..*darodar\.com|blackhatworth\.com|hulfingtonpost\.com|buttons-for-website\.com
  7. Select the views that you want to be filtered (I chose “All web site data”)
  8. Save

The filter pattern is a regular expression, so every time you find a new source of spam, simply add another “|spammersite\.com” (remember to escape dots with a backslash, as they mean “any character”).

It’s playing catch-up with spammers, but as long as Google doesn’t find a way to reliably detect them, it’s the only way to get rid of them. I’ve collected those 7 websites in a couple of months, and I’ve seen them being reported by other users as well. Since after setting the filter I’m no longer getting any bogus traffic, it looks like the problem is still relatively small and can be patched on case-by-case basis.

Create a list of movies to watch with Python and Urlist

I usually like to keep lists of movies to watch, books to read, games to play on Google Keep, mainly because it comes with a widget that looks nice on my phone. Its sharing capabilities though are, well, nonexistent. Yes, you can email a list with a bullet point for each entry but it ends there.

Since I wanted to share a list of movies to watch with my wife, I resorted to Urlist. It’s a neat, straight-to-the-point tool that’s good for sharing links with friends, collaborators, anybody.

I wish they had an API available, so that this post could have been about a tool that automatically creates lists for you (heck, I could even write a simple chrome extension!), but so far there’s none.

Our list is going to have, for every entry, the vote that the movie got on IMDB, a brief summary of its plot and cast. If any of them attracts your SO’s attention, (s)he can just click to see further info about the movie 🙂

The script requires BeautifulSoup and Requests, 2 awesome libraries to scrape the web.

To install them, you can use either pip:

sudo pip install beautifulsoup4 requests

or easy_install:

sudo easy_install beautifulsoup4 requests

Create the list on Urlist, launch the script:

python scrape_IMDB.py

and for every movie you want to add:

  1. search it on IMDB
  2. copy the URL
  3. paste the URL on Urlist to add an entry
  4. paste the URL on the console where the script is running
  5. copy the output of the script
  6. back to Urlist, hit edit and paste what you copied

Here’s the script (you can download it from pastebin):

from bs4 import BeautifulSoup
import requests

done = False

while not done:
  try:
    url = raw_input("IMDB URL: ")

    # get the IMDB page
    r = requests.get(url)
    data = r.text

    # and parse it with BeautifulSoup
    soup = BeautifulSoup(data)

    # the td containing what we're looking for
    td = soup.find('td', {'id': 'overview-top'})
    rating = td.find('div', {'class': 'star-box-giga-star'}).string
    plot = td.find('p', {'itemprop': 'description'}).string
    # the div containing the main actors in the cast
    actors = td.find('div', {'itemprop': 'actors'})
    stars = ', '.join([actor.string for actor in actors.find_all('span', {'class': 'itemprop', 'itemprop': 'name'})])

    print '*%s* - %s. %s' % (rating.strip(), stars, plot)
  except KeyboardInterrupt:
    done = True
print
print 'bye!'

It’s super simple! It gets the page, finds the HTML source for what we’re looking for, and prints it out as formatted text that’s good for Urlist.

The way you find items with BeautifulSoup is relatively similar to what you do with jQuery: you look for elements in the DOM that contain what you’re looking for (to find what they are, just use your browser’s inspector… on Chrome, right click on the text and choose “Inspect element…” to see where it is in the DOM), and manipulate them as strings or arrays of strings.

Easy enough!

Logging in Javascript and filtering by tag

I’m not a fan of loggers in general when it comes to simplicity. They all require you to spend (waste) some time on understanding their configuration syntax, maybe dealing with XML files (what am I, a caveman? :P), and they often require you to adapt to their philosophy. By that I mean, there’s plenty of questions on StackOverflow similar to this, and I definitely share this guy‘s feeling about the issue.

In Javascript, I’ve seen some libraries for logging, and they all look quite complicated to set up. Or, quite complicated considering it’s Javascript we’re talking about. JSLog doesn’t require that much configuration, but most of the times I don’t even care about log levels in JS “apps”, I just want to filter messages according to their context. So I may be interested in all ajax-related logs at some point, but not in DOM manipulation messages. At some other time, I may be interested in DOM manipulation logs, but not in ajax-related messages. These can be though of as log tags (the “DOM” tag, the “ajax” tag, the “user-input” tag, and so on).

A quick and dirty way to deal with all of this is to define some Log functions named after the “context” in which they’re called, and assign them to a no-op function when you want to filter them out.

In Chrome, these could be the log functions (let’s say they’re stored in log.js):

function Log() {
}

Log.message = function(message) {
    console.log(message);
};

Log.dom = Log.ajax = Log.message;

Then, throughout your code, you may have calls like:

Log.dom('adding sidebar');
// in some other place...
Log.ajax('new msg received from server');

All you need to do to disable e.g., all ajax logs, is just change the Log.ajax function to a no-op in log.js, like this:

Log.ajax = function(){};

you can even do it live using the Chrome Developer Tools!

So a question could be: aren’t those calls to a no-op function expensive? Javascript function calls are expensive, right? Why not add a flag like this:

if (logAjaxEnabled) {
   Log.ajax('new msg received from server');
}

to all log calls?

Well, it’s a pain to add all those checks, that’s why 🙂

And your mileage may vary, but in all browsers I’ve tested, calling a no-op is not that bad. It’s way worse than using the flag, but it’s in the same ballpark (assuming you’re not writing some very CPU intensive code), and it’s WAY less painful to use, cause you don’t need to write all those if statements.

I put together a dumb benchmark on JSPerf to see what’s the performance drawback for using this method, you can try it!

It’s also interesting to note how browser performances for something like this vary that wildly.. ah, the JS world!

I’m sure I’m not discovering anything here (I mean, just look at the first few links for a simple Google query like this), but I haven’t seen this approach being used that much (filtering by context, and not by log level). It’s something very close to what Android’s LogCat does, and I really like it!

Could not open the requested socket: Address already in use. Restart Jetty from Eclipse on Mac OSX

I am used to hit on the Play button in Eclipse like hell when developing server apps, so I ran into this issue pretty quickly.

When you’re working with Google App Engine on Mac OSX, pressing that familiar green button after having deployed the app once makes Eclipse complain as in the title. The stop button is grayed out (as it’s controlling the latest instance of Jetty, which didn’t start) and you can’t launch your app without restarting Eclipse.

So, to kill the old Jetty instance you just open a terminal and type:

lsof -i TCP:8888 | grep java | grep LISTEN

Where 8888 is the port on which Jetty is listening (it could be 8080 or something else depending on your configuration), and the first grep is just to stay on the safe side (you don’t want to kill something else). If you’re sure that there’s nothing else listening on that port, just omit it.

The output will be something like

java    33873 myusername   68u  IPv6 0xffffff801a2c1510      0t0  TCP localhost:ddi-tcp-1 (LISTEN)

Then, just type

kill -15 33873

where 33873 is the number in the second column in the output of the previous command.

You can then run the project from Eclipse.

My routine is to keep a terminal window open and just run this one-liner when I run into the error:

kill -15 $(lsof -i TCP:8888 | grep java | grep LISTEN | awk '{ print $2 }')

which does exactly the same thing, but in an automated fashion… it’s just an arrow_up away! 🙂

Change page styles with Greasemonkey/Tampermonkey

This is not a guide, as you can find plenty of them on the web (well, at least for Greasemonkey)…

This is more of a quick and dirty solution to the problem “I just want this thing to be bigger/smaller/a different color” for some web page, and I’m highlighting the word “dirty” here 🙂

I’ll take feedly as an example.

I’m enjoying feedly as a replacement for Google Reader, but I can’t stand its oh-so-narrow central frame when I’m on a 1080p 24” screen.

This is how I “fixed” that.

First, you’ll want to produce the final result you’re aiming for with chrome/firefox developer tools; in Chrome:

  1. right-click on the element whose look you want to change and choose “Inspect element”
  2. move the mouse pointer up and down in the developer tools frame until you see a blueish highlight over the element you want to edit
  3. take note of the element’s type (a <div>, a <p>, a <span>, an <img>, whatever it is) , id or class
  4. use the developer tools to change its looks (just add a custom style on the right under element.style)

Google’s official tutorial on the subject is here.

Once you’ve got a decent looking page (in my case I changed some width and max-width attributes) you’re ready to create a greasemonkey/tampermonkey script that automatically applies those changes for you when you visit that page.

First, install greasemonkey or tampermonkey.

Then, create a new script (tampermonkey has a small icon with a page and a green plus on the top bottom right corner). In the header, the important tag is @match, which tells tampermonkey which pages this script must apply to (in my case, http://cloud.feedly.com/*).

Then, you can copy & paste this piece of code (that I found on the web, it’s used by many user scripts):

function addGlobalStyle(css) {
    var head, style;
    head = document.getElementsByTagName('head')[0];
    if (!head) { return; }
    style = document.createElement('style');
    style.type = 'text/css';
    style.innerHTML = css;
    head.appendChild(style);
}

It’s a function you can call to add CSS rules to the page’s final CSS style.

Then, just call the function for all the styles you want to change using the CSS rules you found at step 3 before, in my case:

addGlobalStyle('.entryBody { max-width: 900px !important; }');
addGlobalStyle('#feedlyFrame { width: 1230px !important; }');
addGlobalStyle('#feedlyPage { width: 900px !important; }');
addGlobalStyle('.entryBody .content img { max-width: 850px !important; width: auto !important; height: auto !important; max-height: 600px !important;}');

All the !important markers are the dirty part: unless the page’s author used those herself (bad, bad author! :P) that tag ensures that your styles are being applied, no matter what. The great thing about !important (which is also the very bad thing) is that it makes styles overwrite definitions even if they’re specified within a style attribute in the element itself!
For example, in:

<div class="wide" style="width: 800px;">

the width value is always overridden by a CSS rule like this:

.wide {
  width: 900px !important;
}

which is both awesome and awful depending on the context 🙂
Feedly has some style definitions like that, and that’s why I needed the !important flags.

Then, save the script and test it!

You can do a lot more than just stuffing your filthy CSS code, of course. I found this great userscript that makes feedly look like google reader (isn’t that what we all want from an RSS reader?), and if you look at the code the author works around the style problem by adding event listeners for DOMNodeInserted, because feedly has a webpage that is built with DOM manipulation performed by javascript. Much more sophisticated 🙂

The lesson here is: search userscripts.org first, and only then create your hacks!

Change default date range in Google Analytics with a Chrome Extension

[Update – May 2015] Updated description so that it matches the latest version of the extension on GitHub

[Update – September 2014] I moved the extension project to GitHub, and updated this post accordingly

This is a continuation from my previous post on the same subject.

I promised a Chrome Extension that opens Google Analytics page and sets today’s date as the default date range. You can grab it from GitHub.

To add it to your Chrome:

  1. download and unzip the extension to some folder
  2. open your Chrome Extensions page (type chrome://extensions in the address bar or press the Settings button (top right) then Tools/Extensions)
  3. drag and drop the extracted ganalytics-lastDay.crx file to the Extensions page, a Drop to install message should appear
  4. confirm the dialog

To configure the extension, simply open it and follow the instructions (which are the same as in my previous post). If you need to change the Analytics code at any moment, you just go back to the chrome://extensions page, find the extension and click on (the ridiculously small) Options button.

In case you want to play with date ranges, follow the instructions on the GitHub page.

The files you want to play with are background.js and conf.js, which both contain the getURL() function (duplicated, because using shared JS files in Chrome Extensions turned out to be a bit tricky). That function takes the portion of the URL manually pasted by the user and builds the full Analytics URL with it. As you can see, there’s 2 variables involved: today and yesterday. You can change these dates using Date‘s functions, like this:

var date = new Date(), today = '', oneMonthAgo = '';

today += date.getFullYear();
today += pad2(date.getMonth() + 1);
today += pad2(date.getDate());

date.setMonth(date.getMonth() - 1);
oneMonthAgo += date.getFullYear();
oneMonthAgo += pad2(date.getMonth() + 1);
oneMonthAgo += pad2(date.getDate());

return 'https://www.google.com/analytics/web/?#home/' + code +'/%3F_u.date00%3D' + oneMonthAgo + '%26_u.date01%3D' + today +'/=';

Of course, you may also want to change the default landing page: just go to that page in Google Analytics and change the ?#home part in the URL with whatever you want, like for example

return 'https://www.google.com/analytics/web/?#report/app-visitors-overview/' + code +'/%3F_u.date00%3D' + oneMonthAgo + '%26_u.date01%3D' + today +'/=';

Something that can also be useful for bookmarklets: if you monitor more than one website/app with Analytics, you may want to have a bookmark for each of them (or you may want to have the extension open Analytics for a specific webpage). Each webpage/app has its own code, so you can either paste the code for the webpage you want the extension to open, or maybe hardcode the correct combination of link and code on different bookmarklets.

Change default date range in Google Analytics

I don’t know the reason why Google chose to set the default date range to the last 30 days excluding the current date (maybe I’m the only one interested in today’s stats), but it’s definitely annoying not having the option to change that default.

There used to be a bookmarklet to overcome the issue, but I’ve not been able to find an update for that after Google’s changed how URLs are managed in Analytics.

So, here’s the update 🙂

The “easiest” way I found to make a bookmarklet works like this:

  1. Log in to your Google Analytics account
  2. Look at the URL, it should be something like https://www.google.com/analytics/web/?hl=en&#home/a12345678w12345678p12345678/
  3. Copy the last portion of the URL, in the example it’s a12345678w12345678p12345678
  4. Open a text editor, copy & paste this code on a new file you can call analytics.html
    <html>
    <head></head>
    <body><a target="_blank" href="javascript:(function(){function d(a){a=String(a);a.length<2&&(a='0'+a);return a}var c=new Date,b='';b+=c.getFullYear();b+=d(c.getMonth()+1);b+=d(c.getDate());location.href='https://www.google.com/analytics/web/?#report/visitors-overview/a12345678w12345678p12345678/%3F_u.date00%3D'+b+'%26_u.date01%3D'+b+'/=';})();">Google Analytics</a></body>
    </html>
    
  5. Replace a12345678w12345678p12345678 in the file with the code you copied at step 3
  6. Save the file and open it with your browser (tested with Chrome, Firefox and Safari on Mac OSX)
  7. Drag the link to your bookmarks bar

Don’t delete/move/rename the HTML file if you’re using Chrome or Firefox, for some reason they need it even after you’ve added the bookmarklet.

This bookmarklet sets the date range to today only, you can play with the javascript to change that (now only b is used, you’d need to create a new Date and set it at the end of the URL). Also, it takes you to the visitors overview page, you can change that by looking at the other pages’ URLs.

Steps 2, 3 and 5 are not technically needed, in that if you leave the bogus URL I put in the code, Analytics is going to tell you that something’s wrong with your credentials, but will set the date anyway (and update that part of the URL). I guess that’s your session ID, so I’m not sure if it’s better to use an existing one or just leave the dummy and let Analytics generate a new one every time. It’s probably hackish to use an old session ID, but it gets rid of the warning dialog and it works!

Ok, this was to create a bookmarklet, but what I actually did was create a Chrome Extension that does the same thing, but has an icon and most of all is listed on the new tab page. If anybody is interested in that, let me know in the comments and I’ll add a new post to explain how it’s done :).

[Edit – June 27]: so here’s the promised extension

HttpPost requests executed multiple times (Apache HttpClient)

This is something I noticed on Android, but from what I read it also involves the desktop Java version.

I was sending POST requests to an API server, and I was getting some random 400 Bad Request responses from time to time. I wish Apache provided an easy way to log the plain text version of Http requests, but I couldn’t find a better way to see what the app was sending than sending the same request to my PC when failing.

So to log requests I start netcat (sudo nc -l 80 on a mac) or a very minimal server in python (it’s more or less the same as the example on Twisted’s front page) and route them there whenever an error occurs.

try {
   response = client.execute(post,
                  new BasicResponseHandler());
} catch (IOException e) {
   if (DEBUG_FAILED_REQUESTS) {
      post.setURI(URI.create(DEBUG_FAILED_REQUESTS_SERVER));
      try {
         client.execute(post, new BasicResponseHandler());
      } catch (IOException e1) {
         e1.printStackTrace();
      }
   }
}

I don’t know if it’s my router, but sometimes connections from the Android device to my PC get blocked: to make them work I just open a browser on the Android, go to some website and then try again with my internal IP (192.168.0.whatever). It always works, no idea why.

Using this code I discovered that my post requests were executed 4 times each, nearly at the same time. I discovered that it’s the default behavior, and you must provide your own RetryHandler if you want the HttpClient to work otherwise.

In my case, my calls are sent to Google’s shortener service, and for some reason sometimes it just rejects requests. If you wait a little bit between attempts you increase your chance of getting valid responses. So this is what I did:

HttpPost post = new HttpPost(SHORTENER_URL);
String shortURL = null;
int tries = 0;
try {
    post.setEntity(new StringEntity(String.format(
            "{\"longUrl\": \"%s\"}",
            getURL(encodedID, encodedAssignedID))));
    post.setHeader("Content-Type", "application/json");
    DefaultHttpClient client = new DefaultHttpClient();
    // disable default behavior of retrying 4 times in a burst
    client.setHttpRequestRetryHandler(new DefaultHttpRequestRetryHandler(
            0, false));
    String response = null;
    while (response == null && tries < RETRY_COUNT) {
        try {
            response = client.execute(post,
                    new BasicResponseHandler());
        } catch (IOException e) {
            // maybe just try again...
            tries++;
            Utils.debug("attempt %d failed... waiting", tries);
            try {
                // life is too short for exponential backoff
                Thread.sleep(RETRY_SLEEP_TIME * tries);
            } catch (InterruptedException e1) {
                e1.printStackTrace();
            }
        }
    }
    Utils.debug("response is %s", response);
    if (response != null) {
        JSONObject jsonResponse = new JSONObject(response);
        shortURL = jsonResponse.getString("id");
    } else if (DEBUG_FAILED_REQUESTS) {
        Utils.debug("attempt %d failed, giving up", RETRY_COUNT);
        debugPost(post, client);
    }
} catch (JSONException e) {
    e.printStackTrace();
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

where debugPost() is a method that calls my PC to log the request, and Utils.debug() is just a small utility method I wrote to log messages with logcat using String.format() if format args are passed to it (it also takes care of splitting messages that would be truncated by logcat itself).

You could choose to implement exponential backoff very easily, but since it’s a blocking operation for the user in my case I preferred not to.

Testing HTML pages for screens with higher resolutions (locally!)

Sometimes you want to test the web page you’re developing on your small 1440×900 laptop monitor for, say, 1080p screens. There are several Chrome extensions for that (I tried Window Resizer and Resolution Test), but they don’t really seem to let you test for resolutions higher than that of your screen.

What I want is just a view of my web page with scrollbars to pan around, and nothing else.

Turns out it’s very simple to achieve that result, but either my Google-fu is getting worse, or it’s not that immediate to find out how to do it on the web.

You just need to create an HTML page that contains an iframe of the correct size that displays your page. That’s it!

Here’s an example page (I saved it as high_res_viewer.html) on pastebin:

High res viewer on pasteBin

And here’s the code. WordPress strips off the iframe code even if it’s within [source][/source] or <pre></pre> tags… meh. If anybody knows how to embed iframe code within WordPress posts please let us know in the comments.

To display it here I had to replace the first iframe tag with a bogus i_frame tag. You need to change it back to iframe of course to make it work.

<html>
<body style="margin:0px;">
<i_frame src="file://localhost/Users/myuser/web/my_page.html" style="border:0px #FFFFFF none;" name="myiFrame" scrolling="auto" frameborder="0" marginheight="0px" marginwidth="0px" height="1080px" width="1920px"></iframe>
</body>
</html>

So you need to set the correct path to your HTML page in src, and the resolution you want to test in height and width.

(The source for the iframe courtesy of the Online iFrame generator)

You can then use that page to develop for 1080p screens. You could create a page for each resolution you want to test (maybe with names like 1080p.html, 1920x1200.html and so on). I’m sure you could also generate the iframe dynamically, providing input fields to set the resolution you want and the link to the page under test, but I just needed a quick solution for the time being.

As always, feel free to correct or improve the post in the comments!