« Scaling Facebook, Hi5 with memcached | Main | Code is our enemy »

'tie' considered harmful

Something has always left me uneasy about the 'tie' feature in perl, and I've been trying to reconcile it with my evolving view of programmer-system productivity.

To productively use a feature, like multi-process append to the same file, you have to understand the underlying performance and reliability behavior. Append is going to work great for 50 apache processes appending lines to a common log file without locking, but not for 2 processes appending 25k chunks to the same file, since they'll get corrupted. If you understand how unix's write-with-append semantics work you can get away with very fast updates to lots of little files without paying any locking penalties (twitter should probably have done something like this).

Similarly, when you see %foo in perl, you instantly know the perf footprint. It's an in-memory hash, it's going to be fast, and you won't get into trouble unless you find a corner like making a zillion hashes-of-hashes and then discover that there's a 200-300 byte overhead for each one.

But tie destroys your knowledge of how the hash works. The perf characteristics become completely different. A simple-minded approach to build a search keyword index with a hash-of-lists which might work acceptibly well with in-memory hashes suddenly becomes a disaster when you tie it to berkeley-db. Because you're not using an in-memory hash anymore, you're using a disguised call to berkeley-db.

I don't think the syntactic sugar win for the notiational convienence trumps the potential confusion to those who will view the code later, or even the confusingly overloaded semantics for the original programmer. I'd rather just know that %foo is an in-memory perl hash, and if I'm going to stuff something in a berkeley-db it's going to be with an explicit API.

As an aside, when I say 'productive', I'm trying to envision the entire life of the code and the product. Not just getting it written and working, but the lifetime maintanence load of the code, will people in ops need to monkey the system to keep it healthy, have pitfalls been left for new programmers inheriting the code, will it gracefully scale, degrade, and so on.

This is related to an evolving philosophy of programmer-system productivity that I've been developing, which I plan to write more about later.

Comments (3)

That exact same criticism is true of operator overloading in other languages. In Python it's common to overload the [] operator, which can make any object use the same syntax as Python dictionaries. Of course, that means that when you see [] you can't assume dictionary performance without knowing what data structure it is that you're actually using.

Thinking back over my tie() uses, I've come to realize that it's only in very specific low-volume situations that just aren't performance critical.

When I'm dealing with something where performance is a concern, I'm generally using a more specific interface than tie().

But Simon is right, too. This is a problem with abstractions in general and operator overloading specifically. When someone hands you an object in Perl, you really have no idea how it's doing things under the hood (and you're not supposed to--in theory) unless you wrote it. That works perfect well until it doesn't.

Yeah, the 'ideal' is that you don't care about how it works.

But of course you must...since any library or object class is an engineering component, in order to take one off the shelf and use in a particular application, you need to know the engineering specs on the thing.

What's the call latency... what could make it get slow... what's the memory footprint... are there cases where it can return unexpected values or hang... does it use a lot of cpu or a little?

For data structures, you want to know memory footprint, access times, traversal times. For other calls, you want to know how many you can do per second, and if there are irregularities (like a thumbnail function that can do 1000/sec usually but degrades to only 1/sec if it sees a funny jpg).

A huge part of what makes programming challenging is working around performance issues. This used to be standard programming instruction, because the machines were more limited in the past. You learned to do qsort instead of writing a bubble sort because you had to, in your limited cpu world.

Now libraries are chock-full of qsort like goodies, and the machine is fast enough to gloss over all kinds of stuff. But when your app outgrows its memory or the confines of a single machine, you're not working on app logic details, you're dealing with performance issues imposed by resource constraints.

Then things get interesting again. :)

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on May 30, 2007 12:39 AM.

The previous post in this blog was Scaling Facebook, Hi5 with memcached.

The next post in this blog is Code is our enemy.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33