Filter out spammers and click bait from Google Analytics

During the last few months, a new wonderful type of spam became part of my life: the Google Analytics spam.

As this article describes, what happens is that you start seeing some blatantly bogus traffic coming from a bunch of websites like semalt.com, buttons-for-website.com, darodar.com, or ilovevitaly.com.

Google announced an Automatic Bot and Spider filtering, but as some users on hacker news reported, it doesn’t work reliably.

So far, the only solution to this problem that worked for me is setting a filter, and add spammers to it as they come. There doesn’t seem to be that many as of today, so this approach is still usable.

[Update – July 2015]: if you have a public HTTP/PHP server available, and are willing to invest half a day to install it, piwik is a nice free, open-source Google Analytics alternative. Piwik uses a community-maintained list of spammers that can also be used in Google Analytics. They wrote a blog post about it, too.

I’ve been using piwik for a few weeks now, and I’m happy with it so far. The nice thing is that updates are very easy to apply, and they include the most recent list of spammers available. The thing that could be improved is the installation process, it’s not as easy as it could be (at least if you’re using Nginx as web server). They also have a Cloud-hosted version, but I guess that if you’re using Google Analytics for free, you’re more interested in free alternatives!

To add a filter in Google Analytics:

  1. go to your Administration page (last tab on your home page)
  2. All filters (on the leftmost column)
  3. New filter
  4. Choose Filter type “Custom” > “Exclude”
  5. Choose “Referral” from the Filter Field menu
  6. Set this as Filter pattern:
    semalt\.com|ilovevitaly\.co|priceg\.com|forum\..*darodar\.com|blackhatworth\.com|hulfingtonpost\.com|buttons-for-website\.com
  7. Select the views that you want to be filtered (I chose “All web site data”)
  8. Save

The filter pattern is a regular expression, so every time you find a new source of spam, simply add another “|spammersite\.com” (remember to escape dots with a backslash, as they mean “any character”).

It’s playing catch-up with spammers, but as long as Google doesn’t find a way to reliably detect them, it’s the only way to get rid of them. I’ve collected those 7 websites in a couple of months, and I’ve seen them being reported by other users as well. Since after setting the filter I’m no longer getting any bogus traffic, it looks like the problem is still relatively small and can be patched on case-by-case basis.

Advertisements

Create a list of movies to watch with Python and Urlist

I usually like to keep lists of movies to watch, books to read, games to play on Google Keep, mainly because it comes with a widget that looks nice on my phone. Its sharing capabilities though are, well, nonexistent. Yes, you can email a list with a bullet point for each entry but it ends there.

Since I wanted to share a list of movies to watch with my wife, I resorted to Urlist. It’s a neat, straight-to-the-point tool that’s good for sharing links with friends, collaborators, anybody.

I wish they had an API available, so that this post could have been about a tool that automatically creates lists for you (heck, I could even write a simple chrome extension!), but so far there’s none.

Our list is going to have, for every entry, the vote that the movie got on IMDB, a brief summary of its plot and cast. If any of them attracts your SO’s attention, (s)he can just click to see further info about the movie 🙂

The script requires BeautifulSoup and Requests, 2 awesome libraries to scrape the web.

To install them, you can use either pip:

sudo pip install beautifulsoup4 requests

or easy_install:

sudo easy_install beautifulsoup4 requests

Create the list on Urlist, launch the script:

python scrape_IMDB.py

and for every movie you want to add:

  1. search it on IMDB
  2. copy the URL
  3. paste the URL on Urlist to add an entry
  4. paste the URL on the console where the script is running
  5. copy the output of the script
  6. back to Urlist, hit edit and paste what you copied

Here’s the script (you can download it from pastebin):

from bs4 import BeautifulSoup
import requests

done = False

while not done:
  try:
    url = raw_input("IMDB URL: ")

    # get the IMDB page
    r = requests.get(url)
    data = r.text

    # and parse it with BeautifulSoup
    soup = BeautifulSoup(data)

    # the td containing what we're looking for
    td = soup.find('td', {'id': 'overview-top'})
    rating = td.find('div', {'class': 'star-box-giga-star'}).string
    plot = td.find('p', {'itemprop': 'description'}).string
    # the div containing the main actors in the cast
    actors = td.find('div', {'itemprop': 'actors'})
    stars = ', '.join([actor.string for actor in actors.find_all('span', {'class': 'itemprop', 'itemprop': 'name'})])

    print '*%s* - %s. %s' % (rating.strip(), stars, plot)
  except KeyboardInterrupt:
    done = True
print
print 'bye!'

It’s super simple! It gets the page, finds the HTML source for what we’re looking for, and prints it out as formatted text that’s good for Urlist.

The way you find items with BeautifulSoup is relatively similar to what you do with jQuery: you look for elements in the DOM that contain what you’re looking for (to find what they are, just use your browser’s inspector… on Chrome, right click on the text and choose “Inspect element…” to see where it is in the DOM), and manipulate them as strings or arrays of strings.

Easy enough!

Logging in Javascript and filtering by tag

I’m not a fan of loggers in general when it comes to simplicity. They all require you to spend (waste) some time on understanding their configuration syntax, maybe dealing with XML files (what am I, a caveman? :P), and they often require you to adapt to their philosophy. By that I mean, there’s plenty of questions on StackOverflow similar to this, and I definitely share this guy‘s feeling about the issue.

In Javascript, I’ve seen some libraries for logging, and they all look quite complicated to set up. Or, quite complicated considering it’s Javascript we’re talking about. JSLog doesn’t require that much configuration, but most of the times I don’t even care about log levels in JS “apps”, I just want to filter messages according to their context. So I may be interested in all ajax-related logs at some point, but not in DOM manipulation messages. At some other time, I may be interested in DOM manipulation logs, but not in ajax-related messages. These can be though of as log tags (the “DOM” tag, the “ajax” tag, the “user-input” tag, and so on).

A quick and dirty way to deal with all of this is to define some Log functions named after the “context” in which they’re called, and assign them to a no-op function when you want to filter them out.

In Chrome, these could be the log functions (let’s say they’re stored in log.js):

function Log() {
}

Log.message = function(message) {
    console.log(message);
};

Log.dom = Log.ajax = Log.message;

Then, throughout your code, you may have calls like:

Log.dom('adding sidebar');
// in some other place...
Log.ajax('new msg received from server');

All you need to do to disable e.g., all ajax logs, is just change the Log.ajax function to a no-op in log.js, like this:

Log.ajax = function(){};

you can even do it live using the Chrome Developer Tools!

So a question could be: aren’t those calls to a no-op function expensive? Javascript function calls are expensive, right? Why not add a flag like this:

if (logAjaxEnabled) {
   Log.ajax('new msg received from server');
}

to all log calls?

Well, it’s a pain to add all those checks, that’s why 🙂

And your mileage may vary, but in all browsers I’ve tested, calling a no-op is not that bad. It’s way worse than using the flag, but it’s in the same ballpark (assuming you’re not writing some very CPU intensive code), and it’s WAY less painful to use, cause you don’t need to write all those if statements.

I put together a dumb benchmark on JSPerf to see what’s the performance drawback for using this method, you can try it!

It’s also interesting to note how browser performances for something like this vary that wildly.. ah, the JS world!

I’m sure I’m not discovering anything here (I mean, just look at the first few links for a simple Google query like this), but I haven’t seen this approach being used that much (filtering by context, and not by log level). It’s something very close to what Android’s LogCat does, and I really like it!

Could not open the requested socket: Address already in use. Restart Jetty from Eclipse on Mac OSX

I am used to hit on the Play button in Eclipse like hell when developing server apps, so I ran into this issue pretty quickly.

When you’re working with Google App Engine on Mac OSX, pressing that familiar green button after having deployed the app once makes Eclipse complain as in the title. The stop button is grayed out (as it’s controlling the latest instance of Jetty, which didn’t start) and you can’t launch your app without restarting Eclipse.

So, to kill the old Jetty instance you just open a terminal and type:

lsof -i TCP:8888 | grep java | grep LISTEN

Where 8888 is the port on which Jetty is listening (it could be 8080 or something else depending on your configuration), and the first grep is just to stay on the safe side (you don’t want to kill something else). If you’re sure that there’s nothing else listening on that port, just omit it.

The output will be something like

java    33873 myusername   68u  IPv6 0xffffff801a2c1510      0t0  TCP localhost:ddi-tcp-1 (LISTEN)

Then, just type

kill -15 33873

where 33873 is the number in the second column in the output of the previous command.

You can then run the project from Eclipse.

My routine is to keep a terminal window open and just run this one-liner when I run into the error:

kill -15 $(lsof -i TCP:8888 | grep java | grep LISTEN | awk '{ print $2 }')

which does exactly the same thing, but in an automated fashion… it’s just an arrow_up away! 🙂

Change page styles with Greasemonkey/Tampermonkey

This is not a guide, as you can find plenty of them on the web (well, at least for Greasemonkey)…

This is more of a quick and dirty solution to the problem “I just want this thing to be bigger/smaller/a different color” for some web page, and I’m highlighting the word “dirty” here 🙂

I’ll take feedly as an example.

I’m enjoying feedly as a replacement for Google Reader, but I can’t stand its oh-so-narrow central frame when I’m on a 1080p 24” screen.

This is how I “fixed” that.

First, you’ll want to produce the final result you’re aiming for with chrome/firefox developer tools; in Chrome:

  1. right-click on the element whose look you want to change and choose “Inspect element”
  2. move the mouse pointer up and down in the developer tools frame until you see a blueish highlight over the element you want to edit
  3. take note of the element’s type (a <div>, a <p>, a <span>, an <img>, whatever it is) , id or class
  4. use the developer tools to change its looks (just add a custom style on the right under element.style)

Google’s official tutorial on the subject is here.

Once you’ve got a decent looking page (in my case I changed some width and max-width attributes) you’re ready to create a greasemonkey/tampermonkey script that automatically applies those changes for you when you visit that page.

First, install greasemonkey or tampermonkey.

Then, create a new script (tampermonkey has a small icon with a page and a green plus on the top bottom right corner). In the header, the important tag is @match, which tells tampermonkey which pages this script must apply to (in my case, http://cloud.feedly.com/*).

Then, you can copy & paste this piece of code (that I found on the web, it’s used by many user scripts):

function addGlobalStyle(css) {
    var head, style;
    head = document.getElementsByTagName('head')[0];
    if (!head) { return; }
    style = document.createElement('style');
    style.type = 'text/css';
    style.innerHTML = css;
    head.appendChild(style);
}

It’s a function you can call to add CSS rules to the page’s final CSS style.

Then, just call the function for all the styles you want to change using the CSS rules you found at step 3 before, in my case:

addGlobalStyle('.entryBody { max-width: 900px !important; }');
addGlobalStyle('#feedlyFrame { width: 1230px !important; }');
addGlobalStyle('#feedlyPage { width: 900px !important; }');
addGlobalStyle('.entryBody .content img { max-width: 850px !important; width: auto !important; height: auto !important; max-height: 600px !important;}');

All the !important markers are the dirty part: unless the page’s author used those herself (bad, bad author! :P) that tag ensures that your styles are being applied, no matter what. The great thing about !important (which is also the very bad thing) is that it makes styles overwrite definitions even if they’re specified within a style attribute in the element itself!
For example, in:

<div class="wide" style="width: 800px;">

the width value is always overridden by a CSS rule like this:

.wide {
  width: 900px !important;
}

which is both awesome and awful depending on the context 🙂
Feedly has some style definitions like that, and that’s why I needed the !important flags.

Then, save the script and test it!

You can do a lot more than just stuffing your filthy CSS code, of course. I found this great userscript that makes feedly look like google reader (isn’t that what we all want from an RSS reader?), and if you look at the code the author works around the style problem by adding event listeners for DOMNodeInserted, because feedly has a webpage that is built with DOM manipulation performed by javascript. Much more sophisticated 🙂

The lesson here is: search userscripts.org first, and only then create your hacks!

Change default date range in Google Analytics with a Chrome Extension

[Update – May 2015] Updated description so that it matches the latest version of the extension on GitHub

[Update – September 2014] I moved the extension project to GitHub, and updated this post accordingly

This is a continuation from my previous post on the same subject.

I promised a Chrome Extension that opens Google Analytics page and sets today’s date as the default date range. You can grab it from GitHub.

To add it to your Chrome:

  1. download and unzip the extension to some folder
  2. open your Chrome Extensions page (type chrome://extensions in the address bar or press the Settings button (top right) then Tools/Extensions)
  3. drag and drop the extracted ganalytics-lastDay.crx file to the Extensions page, a Drop to install message should appear
  4. confirm the dialog

To configure the extension, simply open it and follow the instructions (which are the same as in my previous post). If you need to change the Analytics code at any moment, you just go back to the chrome://extensions page, find the extension and click on (the ridiculously small) Options button.

In case you want to play with date ranges, follow the instructions on the GitHub page.

The files you want to play with are background.js and conf.js, which both contain the getURL() function (duplicated, because using shared JS files in Chrome Extensions turned out to be a bit tricky). That function takes the portion of the URL manually pasted by the user and builds the full Analytics URL with it. As you can see, there’s 2 variables involved: today and yesterday. You can change these dates using Date‘s functions, like this:

var date = new Date(), today = '', oneMonthAgo = '';

today += date.getFullYear();
today += pad2(date.getMonth() + 1);
today += pad2(date.getDate());

date.setMonth(date.getMonth() - 1);
oneMonthAgo += date.getFullYear();
oneMonthAgo += pad2(date.getMonth() + 1);
oneMonthAgo += pad2(date.getDate());

return 'https://www.google.com/analytics/web/?#home/' + code +'/%3F_u.date00%3D' + oneMonthAgo + '%26_u.date01%3D' + today +'/=';

Of course, you may also want to change the default landing page: just go to that page in Google Analytics and change the ?#home part in the URL with whatever you want, like for example

return 'https://www.google.com/analytics/web/?#report/app-visitors-overview/' + code +'/%3F_u.date00%3D' + oneMonthAgo + '%26_u.date01%3D' + today +'/=';

Something that can also be useful for bookmarklets: if you monitor more than one website/app with Analytics, you may want to have a bookmark for each of them (or you may want to have the extension open Analytics for a specific webpage). Each webpage/app has its own code, so you can either paste the code for the webpage you want the extension to open, or maybe hardcode the correct combination of link and code on different bookmarklets.