« Thank heaven for tax refunds | Main | The news medium has a message: "Goodbye" »

Detecting spam from http headers?

Greg Linden describes a paper about finding spam simply by inspecting the returned http headers:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.

After running web crawls for the past year and finding all manner of spam, I have to say I'm skeptical this technique would really catch much spam on the actual web. Among the top 10 http header features they identify as spam-predictors are:

  • Accept-Ranges: bytes
  • Content-Type: text/html; charset=iso-8859-1
  • Server: Fedora
  • X-powered-by: php/4
  • 64.225.154.135

These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.

$ curl -I http://www.fancieface.com/
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 19:13:11 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
ETag: "2081cc-ba62-48fdc22e"
Accept-Ranges: bytes
Content-Length: 47714
Content-Type: text/html

Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:

$ curl -I http://superjewelryguide.com/300.html
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 17:48:26 GMT
Server: Apache
X-Powered-By: PHP/5.2.6
Content-Type: text/html

Again, pretty vanilla. Plus this technique isn't going to work at all for spam hosted within trusted domains. Here's some cialis spam smeared onto a my.nbc.com page:

$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
X-Powered-By: PHP/5.1.6
Wirt: (null)
Content-Type: text/html
Expires: Sat, 22 Nov 2008 19:16:33 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 22 Nov 2008 19:16:33 GMT
Content-Length: 0
Connection: keep-alive
Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
        expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/

but very fishy headers! :-)

It's incredibly difficult to get a high quality random sample of the web. You can't factor crawler strategy bias out of the sample, and any small sample is not necessarily going to very representative.

If the researchers did find good coverage with quirky headers and even individual ip addresses, I suspect that the crawl they're using may be over-weighted in pages from a few servers that spewed out a lot of urls/virtual hosts.

Comments (4)

Pete:

Choice of headers sounds pretty shifty to me.

Al:

Classifying spam based on HTTP headers seems like a short term solution to the problem; the day they identify what makes their headers spammy - they change their headers to confirm with the majority.

What about the millions of pages of spam on the internet published through vanilla cookie cutter CMS such as WordPress, Drupal, MT or the like - it'd hardly seem fair to classify this site as spam simply because you're publishing the same HTTP headers as another spammy MT site.

Gom:

They used the document header, and not just the server header.

That's not what the paper says or describes.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on November 22, 2008 11:22 AM.

The previous post in this blog was Thank heaven for tax refunds.

The next post in this blog is The news medium has a message: "Goodbye".

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33