Recently, I joined the Subkismet project which is an open source stand-alone comment spam filtering library for ASP.NET web applications founded by Phil Haack. My task is to write mechanisms for fighting trackback and pingback spam comments. More precisely, I will be writing two base classes for handling trackbacks and pingbacks that anyone can use in their own project.

Before I got actively involved in Subkismet, I wrote a short paper on the principles of trackback spam fighting. These principles were originally used for BlogEngine.NET and now also a part of Subkismet. When the classes are done I will port the updated code back to BlogEngine.NET again.

I thought that others might be able to make use of these principles and decided to share. Here it is:

Fight trackback spam

A trackback request is a standard POST request sent to a web server. It is similar to posting back a form on a webpage in that it also sends parameters with the request. These parameters are used by the receiver to handle the request and register the trackback. The parameters are:

id – the id of the post the request tries to send a trackback to
title – the title of the trackback
excerpt – the message the sender want to send to the receiver
blog_name – the name of the sending blog
url – the url of the sender’s webpage containing the trackback link

To fight spammers, we can analyse many different things from the information received in the request parameters above. This document tries to provide a basic introduction into the analysis and what measures to take in case the sender is a spammer.

Confirm the sender

When a trackback request is sent to a trackback enabled website, the website has the ability to validate the sender before accepting the request. The sending website has to have a link to your website; otherwise it is not a valid trackback according to the specifications. To make sure that it does, you can follow these steps.

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback.
4: If link is NOT confirmed, end the response and send HTTP status code 404.

The reason why the response has to end if the sender is not confirmed is because there is no point in telling the spammer whether or not we actually support trackbacks. The clever solution is to send a status code 404 back to the spammer, indicating that it makes no sense trying again because the trackback handler does not exist.

Here is an example in C# 2.0 that shows how to examine the sender’s webpage:

private bool IsSenderConfirmed(string sendingUrl, string receivingUrl)
{
  try
  {
    using (WebClient client = new WebClient())
    {
      string html = client.DownloadString(sendingUrl);
      return html.ToLowerInvariant().Contains(receivingUrl.ToLowerInvariant());
    }
  }
  catch (WebException)
  {
    return false;
  }
}

This technique is very basic but maybe the most important factor for fighting spammers. However, there exist link farms with the sole purpose of beating this approach, so there is a need to be even stricter.

Restrict the number of allowed trackbacks

When a spammer finds a website that allow him to create trackback spam, he will keep on doing so with as many trackbacks as possible – maybe over time so you won’t notice it right away. That’s why it is very important to only allow 1 trackback per sender per post.

After the sender has been confirmed the trackback handler must now check if another trackback from this sender has already been registered. If so, the sender must be rejected nicely because he might not be a spammer.

Because a trackback spammer uses multiple websites, user agents and IP addresses to bypass spam filters, the handler must use all information possible and check for them all individually. Two different spam requests might come from the same IP address, but with different referring websites. Make sure to check both.

Now the flow looks like this:

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback according to specs
4: If link is NOT confirmed, end the response and send HTTP status code 404
5: If sender has been registered before, nicely decline the request according to the specs
6: If sender has NOT been registered before, register the trackback according to the specs

Check for URL’s

The request’s excerpt – the trackback message – has to be checked for suspicious content. A spammer always tries to send URL’s so that your visitors might click on them. That’s the purpose of trackback spam. If the handler receives an excerpt with a URL it raises the chances of the sender being a spammer, but it is not a certainty. If it receives 2 or more URL’s, then it almost certainly is a spammer and should be rejected.

You can use this method to determine how many URL’s the excerpt contains:

private static int UrlCount(string excerpt)
{
  string pattern = "((http://|www\\.)([A-Z0-9.-]{1,})\\.[0-9A-Z?&=\\-_\\./]{2,})";
  Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
  return regex.Matches(excerpt).Count;
}

If a URL is embedded in a HTML link tag (<a href=”example.com”>link text</a>) it certainly is a spammer. No blog engine sends HTML in the trackback message, so this is a clear indication that it was sent by a spammer.

To find out if the excerpt contains HTML, you can use this method:

private static bool ContainHtml(string excerpt)
{
  string pattern = @"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
  Regex regex = new Regex(pattern, RegexOptions.Singleline);
  return regex.IsMatch(excerpt);
}

The flow now looks like this:

1: Trackback request received
2: Check the sending website for link
3: If link is confirmed, register the trackback according to specs
4: If link is NOT confirmed, end the response and send HTTP status code 404
5: If sender has been registered before, nicely decline the request according to the specs
6: If sender has NOT been registered before, register the trackback according to the specs
7: If the excerpt contains 2+ links, end the response and send HTTP status code 404
8: If the excerpt does NOT contain links, register the trackback according to specs

 If you have any other ideas for fighting trackback spam, please tell me so we can make Subkismet as bulletproof as possible.

Comments


Comments are closed