« October 2008 | Main | March 2009 »

November 2008 Archives

November 2, 2008

Lucy on Elections

It's hard being a campaign worker.
We're completely at the mercy of our candidate.
We do all the work, and the candidate gets all the credit.
We ring doorbells, and make the posters, and build up the candidate's image.
And then he says something stupid, and ruins everything we've done.

The next time I do any campaigning, it's gonna to be for myself!

      -- Lucy, You're (not) elected, Charlie Brown

November 14, 2008

Cold calls, cold response

Every few days cold-calling salespeople show up at our office unnannounced to pitch us on insurance, lease deals, laser toner, office supplies, voip plans, bottled water, etc.

We have an open office. So when they enter, 11 people immediately look up at them. This can apparently be somewhat intimidating, based on their flummoxed reactions. They usually ask for a business card so they can call us later. I sometimes offer them mine, since my card doesn't have a phone number on it. Then they beat a hasty retreat.

Lately we've been trying a new tactic - not acking their presence when they come in. There's no receptionist (of course), and it's not clear who they should attempt to speak with. None of us really want to listen to their pitch or take their flier anyway, so playing the game of chicken with the other folks in the office sort of emerged as a default behavior. Who will be the first to crack at their nervousness, make eye contact, and thus become the dupe left holding the flier or handing out their business card?

I almost feel sorry for them. Almost!

November 21, 2008

Thank heaven for tax refunds

In 2000 before the dot-com meltdown I bought a few cases of french bordeaux. Even though I like bordeaux, it half-seemed like a silly purchase at the time, but when the wine arrived I was happy because the bordeaux had risen in value since I purchased it, but due to the stock market death-spiral my accounts had gone down in the meantime. win, sorta.

Unfortunately there was also a bmw 540 that I decided was too indulgent to buy and passed on. Afterward I kicked myself -- it would have been free. I would have exercised some netscape options I had to buy it. I held onto them, eventually they declined in value until they were worthless. I should have bought the car!

I saw a joke circulating at the time that beer would have yielded a better return than some stocks. The beer bottles could be returned for the 5 cent deposit, but stocks became worthless. Plus you would get to drink the beer.

Now we're going through it again, but even worse. The banker line now is that it's not the return on your capital that you should be worried about, it's the return of your capital.

I just got a state of California tax refund check. Normally it's ineffecient to pay too much withholding, essentially lending the government your money interest-free until tax time. In this case though it turned out to be a decent investment. :-|

November 22, 2008

Detecting spam from http headers?

Greg Linden describes a paper about finding spam simply by inspecting the returned http headers:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.

After running web crawls for the past year and finding all manner of spam, I have to say I'm skeptical this technique would really catch much spam on the actual web. Among the top 10 http header features they identify as spam-predictors are:

  • Accept-Ranges: bytes
  • Content-Type: text/html; charset=iso-8859-1
  • Server: Fedora
  • X-powered-by: php/4

These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.

$ curl -I http://www.fancieface.com/
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 19:13:11 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
ETag: "2081cc-ba62-48fdc22e"
Accept-Ranges: bytes
Content-Length: 47714
Content-Type: text/html

Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:

$ curl -I http://superjewelryguide.com/300.html
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 17:48:26 GMT
Server: Apache
X-Powered-By: PHP/5.2.6
Content-Type: text/html

Again, pretty vanilla. Plus this technique isn't going to work at all for spam hosted within trusted domains. Here's some cialis spam smeared onto a my.nbc.com page:

$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
X-Powered-By: PHP/5.1.6
Wirt: (null)
Content-Type: text/html
Expires: Sat, 22 Nov 2008 19:16:33 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 22 Nov 2008 19:16:33 GMT
Content-Length: 0
Connection: keep-alive
Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
        expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/

but very fishy headers! :-)

It's incredibly difficult to get a high quality random sample of the web. You can't factor crawler strategy bias out of the sample, and any small sample is not necessarily going to very representative.

If the researchers did find good coverage with quirky headers and even individual ip addresses, I suspect that the crawl they're using may be over-weighted in pages from a few servers that spewed out a lot of urls/virtual hosts.

About November 2008

This page contains all entries posted to Skrentablog in November 2008. They are listed from oldest to newest.

October 2008 is the previous archive.

March 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33