Greg Linden
describes a paper about
finding spam simply by inspecting the returned http headers:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.
We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
After running web crawls for the past year and finding all manner of spam,
I have to say I'm skeptical this technique would really catch much spam
on the actual web. Among the top 10 http header features they identify
as spam-predictors are:
- Accept-Ranges: bytes
- Content-Type: text/html; charset=iso-8859-1
- Server: Fedora
- X-powered-by: php/4
- 64.225.154.135
These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.
$ curl -I http://www.fancieface.com/
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 19:13:11 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
ETag: "2081cc-ba62-48fdc22e"
Accept-Ranges: bytes
Content-Length: 47714
Content-Type: text/html
Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:
$ curl -I http://superjewelryguide.com/300.html
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 17:48:26 GMT
Server: Apache
X-Powered-By: PHP/5.2.6
Content-Type: text/html
Again, pretty vanilla. Plus this technique isn't going to work at all for
spam hosted within trusted domains. Here's some cialis spam smeared onto
a my.nbc.com page:
$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
X-Powered-By: PHP/5.1.6
Wirt: (null)
Content-Type: text/html
Expires: Sat, 22 Nov 2008 19:16:33 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 22 Nov 2008 19:16:33 GMT
Content-Length: 0
Connection: keep-alive
Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/
but very fishy headers! :-)
It's incredibly difficult to get a high quality random sample of the web.
You can't factor crawler strategy bias out of the sample, and any small
sample is not necessarily going to very representative.
If the researchers did find good coverage with quirky headers and even
individual ip addresses, I suspect that the crawl they're using may
be over-weighted in pages from a few servers that spewed out a lot of
urls/virtual hosts.