Die ugly database, die

Posted by Antonio 1 year, 3 months ago (Oct. 5, 2007)

Last week's awesome encounter with Alan Kay left me with quite a few anecdotes; during one particular moment in the conversation, he reminded me in tone of the caught villains on Scooby Doo who always end the episode by claiming that they would have gotten away with it if it wasn't for "those meddling kids." Except of course that he was talking about the stagnant state of the software industry and the meddling kids are all of us failing to read the literature.


I am as guilty of this as the next guy which is why I decided to catch up on my list of queued papers on a recent cross country flight. Oracle HQ The theme of this bunch: the limitations and fallacies of the current web application development stack, and specifically the relational database. In particular, I wanted to point out two interesting articles in the "your architecture sucks" department:

1. In "The End of an Architectural Era" Stonebreaker et al. argue that using relational databases for most of today's web-scale information processing tasks is like throwing a buggy-whip under the hood of the car and expecting it to go fast. It's a good read (at least right up to the point which they start talking about their plan for a distributed transaction processing system) because it covers a lot of the history and design motivations behind the original RDBMS (back in the day of some very IBM sounding thing called "System R") and makes interesting points about places where web scale computing that is done on the back on an RDBMs is taking on unnecessary overhead. My only criticism of the paper is that in the proposed alternative the authors are willing to just toss away one of the most important benefits of a relational database— the ability for developers/administrators/whoever to perform ad-hoc queries on the dataset without having to drop to writing code.

2. Not having been crazy about the proposed alternative in the Stonebreaker paper, I was excited to come across "Dynamo: Amazon's Highly Available Key-value Store" by a bunch of smart dudes at Amazon, because of how well it was written, how simple and elegant the solution they propose seems to be, and most importantly, because it describes a system that is in actual production use every day. The basic concept is BerkleyDB on steroids, distributed across an arbitrary number of nodes in a fault-tolerant and self-healing design. In English: you can have persistent, reliable hashtables even in the midst of a semi-reliable infrastructure (node outages and network partitions), all at web scale. After pooh-poohing the relational database (sorry RDBMS, you are just not having a good month!), they go on to describe a system that supports an "always writeable" datastore and fails only in the rarest of cases. Best concept of the whole paper: the "vector clock" that follows each write operation around the network of nodes (a "vector clock" is a pairing of a write's version # with the node that initially takes that write) to solve conflicts across the system. I wish everything I ever did on any computing device came with its own vector clock!

Alan is right— anyone building a website that could someday live at web-scale (millions of users, billions of transactions) should be reading these papers rather than simply taking on the intellectual challenge of re-inventing some of these schemes from scratch (we were guilty of some of this at Tabblo). Fortunately we've got a lot of resources around to help the cause; if you haven't seen the always-entertaining, and frequently excellent "High Scalability" blog (source for both of these papers), you should subscribe.

Tags:

Comments

Post a comment

(Please use only plain text. Though I will escape all of your HTML, URLs will be clickable)

Your name:

Comment: