In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.
We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
After running web crawls for the past year and finding all manner of spam, I have to say I'm skeptical this technique would really catch much spam on the actual web. Among the top 10 http header features they identify as spam-predictors are:
- Accept-Ranges: bytes
- Content-Type: text/html; charset=iso-8859-1
- Server: Fedora
- X-powered-by: php/4
These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.
$ curl -I http://www.fancieface.com/ HTTP/1.1 200 OK Date: Sat, 22 Nov 2008 19:13:11 GMT Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT ETag: "2081cc-ba62-48fdc22e" Accept-Ranges: bytes Content-Length: 47714 Content-Type: text/html
Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:
$ curl -I http://superjewelryguide.com/300.html HTTP/1.1 200 OK Date: Sat, 22 Nov 2008 17:48:26 GMT Server: Apache X-Powered-By: PHP/5.2.6 Content-Type: text/html
Again, pretty vanilla. Plus this technique isn't going to work at all for spam hosted within trusted domains. Here's some cialis spam smeared onto a my.nbc.com page:
$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here HTTP/1.1 200 OK Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6 X-Powered-By: PHP/5.1.6 Wirt: (null) Content-Type: text/html Expires: Sat, 22 Nov 2008 19:16:33 GMT Cache-Control: max-age=0, no-cache, no-store Pragma: no-cache Date: Sat, 22 Nov 2008 19:16:33 GMT Content-Length: 0 Connection: keep-alive Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000; expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/
but very fishy headers! :-)
It's incredibly difficult to get a high quality random sample of the web. You can't factor crawler strategy bias out of the sample, and any small sample is not necessarily going to very representative.
If the researchers did find good coverage with quirky headers and even individual ip addresses, I suspect that the crawl they're using may be over-weighted in pages from a few servers that spewed out a lot of urls/virtual hosts.