The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?

You apply a hash function to the key. That's it? Yeah, that's the whole deal of consistent hashing. It's in the name, isn't it?

The same key will always return the same hash code (hopefully), so once you've figured out how you spread out a range of keys across the nodes available, you can always find the right node by looking at the hash code for a key.

It's pretty ingenious, if you ask me. It was cooked up in the lab chambers at Akamai, back in the late nineties. You should go and read the original paper right after we're done here.

Consistent hashing solves the problem people desperately tried to apply sharding to pretty nicely and elegantly. I'm not going to bore you with the details on how exactly consistent hashing works. Mike Perham does a pretty good job at that already, and there are many more blog posts explaining implementations and theory behind it. Also, that little upcoming book of mine has a full-length explanation too. Here's a graphic showing the basic idea of consistent hashing, courtesy of Basho.

Consistent Hashing

Instead I want to look at the practical implications of consistent hashing in distributed databases and cache farms.

Easier to Avoid Hotspots

When you put data on nodes based on a random result, which is what the hash function calculates, a value that's a lot more random than the key it's based on, it's easier to avoid hotspots. Why?

Assume a key based on an increasing value, or a simple range of keys, based on the hour of the day, like 2011-12-11-13. You add new hours and therefore new data as time passes, and keys are stored based on the range they fall in. For example, the keys 2011-12-11-18 until 2011-12-11-23 are stored on the same node, with the rest of the keys stored on other nodes, just because the ranges or the partitioning scheme happen to be set up this way.

For a consumer-facing site, the evening hours are usually the busiest time of the day. They create more data, more writes, and possibly more reads too. For the hours between 18:00 and 23:00, all the load goes to the single node that carries all the relevant data.

But when you determine the location in the cluster based solely on the hash of the key, chances are much higher that two keys lexicographically close to each other end up on different nodes. Thus, the load is shared more evenly. The disadvantage is that you lose the order of keys.

There are partitioning schemes that can work around this, even with a range-based key location. HBase (and Google's BigTable, for that matter) stores ranges of data in separate tablets. As tablets grow beyond their maximum size, they're split up and the remaining parts re-distributed. The advantage of this is that the original range is kept, even as you scale up.

Consistent Hashing Enables Partitioning

When you have a consistent hash, everything looks like a partition. The idea is simple. Consistent hashing forms a keyspace, which is also called continuum, as presented in the illustration. As a node joins the cluster, it picks a random number, and that number determines the data it's going to be responsible for. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node. The resulting partition could be of any size theoretically. It could be a tiny slice, or a large one.

First implementations of consistent hashing still had the problem that a node picking a random range of keys resulted in one node potentially carrying a larger keyspace than others, therefore still creating hotspots.

But the improvement was as simple as it was ingenious. A hash function has a maximum result set, a SHA-1 function has a bit space of 2^160. You do the math. Instead of picking a random key, a node could choose from a fixed set of partitions, like equally size pizza slices. But instead of picking the one with the most cheese on, everyone gets an equally large slice. The number of partitions is picked up front, and practically never changes over the lifetime of the cluster.

For good measure, here's a picture of a sliced pizza.

Consistent Pizza

Partitioning Makes Scaling Up and Down More Predictable

With a fixed number of partitions of the same size, adding new nodes becomes even less of a burden than with just consistent hashing. With the former, it was still unpredictable how much data had to be moved around to transfer ownership of all the data in the range of the new node. One thing's for sure, it already involved a lot less work than previous methods of sharding data.

With partitioning, a node simply claims partitions, and either explicitly or implicitly asks the current owners to hand off the data to them. As a partition can only contain so many keys, and randomness ensures a somewhat even spread of data, there's a lot less unpredictability about the data that needs to be transferred.

If that partitions just so happens to carry the largest object by far in you whole cluster, that's something even consistent hashing can't solve. It only cares for keys.

Going back to HBase, it cares for keys and the size of the tablet the data is stored in, as it breaks up tablets once they reach a threshold. Breaking up and reassigning a tablet requires coordination, which is not an easy thing to do in a distributed system.

Consistent Hashing and Partitioning Enable Replication

Consistent hashing made one thing a lot easier: replicating data across several nodes. The primary means for replication is to ensure data survives single or multiple machine failures. The more replicas you have, the more likely is your data to survive one or more hardware crashes. With three replicas, you can afford to lose two nodes and still serve the data.

With a fixed set of partitions, a new node can just pick the ones it's responsible for, and another stack of partitions it's going to be a replica for. When you really think about it, both processes are actually the same. The beauty of consistent hashing is that there doesn't need to be master for any piece of data. Every node is simply a replica of a number of partitions.

But replication has another purpose besides ensuring data availability.

Replication Reduces Hotspots (Even More!!!)

Having more than one replica of a single piece of data means you can spread out the request load even more. With three replicas of that data, residing on three different nodes, you can now load-balance between them. Neat!

With that, consistent hashing enables a pretty linear increase in capacity as you add more nodes to a cluster.

Consistent Hashing Enables Scalability and Availability

Consistent hashing allows you to scale up and down easier, and makes ensuring availability easier. Easier ways to replicate data allows for better availability and fault-tolerance. Easier ways to reshuffle data when nodes come and go means simpler ways to scale up and down.

It's an ingenious invention, one that has had a great impact. Look at the likes of Memcached, Amazon's Dynamo, Cassandra, or Riak. They all adopted consistent hashing in one way or the other to ensure scalability and availability.

Want to know more about distributed databases in general and Riak in particular? You'll like the Riak Handbook, a hands-on guide full of practical examples and advice on how to use Riak to ensure scalability and availability for your data.

In the next installment we're looking at the consequences and implications of losing key ordering in a Riak cluster.

Tags: nosql, riak

Key rack

One of the most common questions to ask about Riak is: how do I get a list of all the keys in my bucket, in the cluster, or that have an attribute that matches my query using MapReduce?

The motivation behind it is simple: you want to delete all the keys in a bucket, count the amount of keys stored in your cluster entirely, you want to clear out your cluster or you want to run ad-hoc queries on the data stored in a bucket.

All valid in their own right.

But things are not so simple with Riak. To understand why, let's take a quick look under the covers.

What's in a bucket?

A bucket is a namespace in Riak. It's not a physically distinctive entity like a table in a relational database. You can set some properties on it, things like replication levels, commit hooks, quorum, but that's it. Those are stored in the cluster's configuration which is gossiped around the cluster just like the data that identifies what partition goes on which machine.

In fact, when you specify a bucket and a key to fetch or write some data, they're stuck together to find the location in the cluster. Consider a bucket-key combination like users/roidrage. To find the location, Riak hashes both, not just the key. Both bucket and key uniquely identify a piece of data, allowing you to have multiple object with the same key, but in different buckets.

When an object is stored in Riak's storage backends, it uses both bucket and key name to identify it. What you get as a result are files that contain an abundance of different bucket-key combinations and their respective objects, sometimes not even in any order. The only physical distinction Riak has for data on disk is the partition they belong to. Everything else is up to the storage backend. There's no distinction between buckets on disk.

One reason for this is consistent hashing. If you remember the last installment of this series, I mentioned that consistent hashing's downside is that you lose key ordering. Keys are randomly spread out through the cluster. Some ordering still exists depending on the backend, but in general, ordering is lost.

Listing all of the keys

So to list keys in a bucket, Riak has to go through all of the keys in every partition, and I mean ALL OF THEM. Here's a picture of keys and an impersonation of Riak, having to take care of all of them.

No big deal, right? Unless of course, you store millions and millions of them, and want to find about all the keys from say, the bucket users, which may not even have to be more than 1000. To do that, Riak goes through every partition, every partition loads the keys either from memory (Bitcask) or disk (LevelDB) and sifts through them, finding the ones belonging to the users bucket.

All that said, it's certainly not impossible to do, if you have some time to wait, depending on the amount of data stored.

$ curl 'localhost:8098/buckets/users/keys?keys=true'

But wait, don't do that. Do this instead, streaming the keys instead of waiting for them all to arrive and then having them dumped at once.

$ curl 'localhost:8098/buckets/users/keys?keys=stream'

That's much better.

Listing keys has an impact on your Riak nodes, so if you can avoid it, don't do it!

So how do I really get all of the keys?

If select * from riak is not a great option, then what is?

Instead of relying on Riak, build an index on the keys. Thanks to Riak 2i (Secondary Indexes), this is easy. In fact, you get indexing of keys for free when using the LevelDB backend, just use the index $key. This takes advantage of LevelDB's sorted file structure. Neat!

But, and here's the kicker, you can only fetch ranges of keys. So instead of asking for all the keys, you ask for a range large enough to fit all the keys.

$ curl 'localhost:8098/buckets/users/index/$key/0/zzzz'

This finds all the keys that start with something lexicographically larger than 0 and less than zzzz and returns them to you in a list. Now there's a slim chance you'll get users with names like that, but I'll leave that exercise, or proper validations, up to you.

Using that list, you can count the number of keys in that bucket, or you can delete them one by one.

Ideally...

In an ideal world, listing keys in a bucket would be possible and not an expensive operation. Riak could for example allow users to store buckets in separate files. The downside is that with a lot of buckets, you'll hit the limits of open file descriptors in no time, a bit of a bummer. But until something better comes along, secondary indexes are a nice tool to at least avoid resorting to listing all of the keys.

Curious about other ways to index and query data in Riak? You'll like the Riak Handbook, which will be published later this week. Covers Riak's secondary indexes and other strategies to query, inspect and analyze data.

Check back in tomorrow for an introduction on storing timelines in Riak.

Tags: riak, nosql

A couple of months ago I set out to write a book on NoSQL. It's about time I give an update on how it's been going, and when you can expect a book in your hands, or rather, on your screen.

After an initial burst of writing, I took somewhat of a break, so please excuse the delay in general. I spent the last weeks writing (a lot), and I'm currently trying to polish and finish up the existing content so that I can throw something out there for the world to peek at. Currently, the book covers MongoDB, Riak and Redis in varying detail, and I'm working to finish up loose ends to get into a good shape, before I'm starting work on more chapters.

A lot of people have asked me about pricing, distribution model, updates, and so on, so I'm following up with an FAQ section. In general, I'm as keen to get something out there as people have expressed their interest in reading it, believe me.

Turns out though: writing a book is hard. It takes a lot of work, a lot of discipline and creativity to come up with the right words and code examples. I'm not complaining, it's just something you don't realize from writing even slightly longer blog posts. It's still an incredible learning experience too, because I (and you) get to play with pretty much all of the features the databases covered have to offer.

So bear with me, I'm on it.

The book is not built around the idea that a big application is to be built with each database. I'm not a fan of that approach myself, as it makes it too easy to lose track of details. It's full of small examples, focused on specific features.

How many pages does it have?

As the book is still growing, and I'm still playing with layouting details, I can't give you an exact number, but the final book is probably going to have more than 200 pages.

What's the pricing going to be?

I haven't decided yet. It's not going to be in the single digits pricing range, and as the book is pretty dense with content, I don't want to undercharge. I'll keep you posted.

Will there be early access to the book?

Yes, there will be. You'll be able to buy the beta of the book for a reduced price, and follow the updates. Maybe even the commits on GitHub? I don't know. Let me know if that's something you're interested in.

Do you have some samples I can peek at?

Not yet. Layout is still far from final, but I'll throw something out as soon as an early access will be available.

What databases are being covered?

To reach my goal for a final release, I'm covering Redis, MongoDB, CouchDB, Riak, and Cassandra, all in varying detail. For some it makes more sense to go deeper than for others.

Are future updates included?

Yes, as content gets added, typos get fixed, and new databases pop up, I'll send updates to everyone buying the book. The updates are free. Consider buying the book a subscription for more chapters on other databases.

Are you extending the book with more databases over time?

Yes, I have an insatiable thirst to play with more databases, and I don't want to deprive you of experiencing that too.

Are you covering database SomeDB? I hear it's the next big thing!

For now, the list of databases is fixed. What's coming after that, on the other hand, is not. I'm open to suggestions, but I'd prefer some non-exotic over a very domain-specific database you wrote for a recent project. I'll set up some sort of voting when the final release is done.

What formats will be available?

I'm currently working on PDF and ePub, with Kindle to follow. Gotta have my priorities. A good-looking and readable PDF is my first priority, an ePub after that. Buying the book includes access to all formats.

Is there going to be a print edition?

Print is not a priority right now.

What are you using to write and generate the book?

The book is written in Markdown (I hate LaTeX), converted to HTML using Redcarpet, using Albino for syntax-highlighting, and converted to PDF using the awesome Prince XML library, I hope to eventually use DocRaptor to create the final result, as a Prince license is slightly out of budget, but DocRaptor is pretty affordable.

Where can I get updates on progress?

Mostly be following me or the handbook itself on Twitter.

Tags: nosql, books

I'm happy to report that the Riak Handbook has hit a major update, bringing a whopping 43 pages of new content with it. If you already bought the book, this is a free update, and instructions how and where to download it were sent in a separate email.

Here's a run down of what's new:

  • The book was updated for Riak 1.1.

  • The biggest new feature of the Riak Handbook is a section entirely dedicated to use cases, full of examples and code from real time usage scenarios, giving you food for thought when it comes to using Riak in your applications. Rumor has it there's also an introduction on Riak CS in there, the new cloud storage system that's API-compatible with Amazon's S3.

  • Full coverage on pre- and post-commit hooks. Get ready for a thorough run-through of things you can do with your data before and after it's written to Riak. Covers both JavaScript and Erlang examples.

  • To go full circle on deploying Erlang code in a Riak cluster, the book now has a section dedicated to it, so you don't have to manually compile Erlang code every time you update it.

  • More details on secondary indexes, how you can use them to store multiple values for a single document and how you can use them to model object associations.

  • A section on load balancing Riak nodes.

  • An introduction on where on a network you can and should put your Riak nodes.

  • A big, big section on monitoring and an overview of Riak Control, the new cluster management tool introduced in Riak 1.1.

  • Last but not least, the book now comes as a man page. Need to look something up real quick while you're working on the command line? No worries, drop the man page in your $MANPATH and type "man riak". Easy to search when you're working remotely on a Riak node, or if you're generally a fan of spending a lot of time on the command line (just like I am).

To celebrate the new release, the Riak Handbook is 25% off until next Monday, June 4th. It's a release party, and everyone's invited!

Tags: riak, handbook, nosql

The NoSQL landscape is a fickle thing, new tools popping up every week, broadening a spectrum that's already close to being ungraspable, especially when you're totally new to the whole thing. There's a couple of common misconceptions and wrong-doings that people who've been playing with the tools already tend to tell newbies in the landscape.

I'm guilty as charged too, I tend to tell people about the tools I already know. Being a good thing per se, because the recommendation is based more or less on experience, it leaves out one thing that I find to be the most important philosophy about post-relational (much nicer term than NoSQL) databases: It's all about your data, about its needs and how your application needs to access them. The times of generic, one-size-fits-all tools like MySQL, PostgreSQL and the like are over, it's well worth knowing how they're different, and what tool would be the best partner in chrime to get stuff done.

No Size Fits All

While you could throw MySQL at a lot of problems, it was far from being the optimal choice in a lot of cases, but you could somehow bend it to your will. The new generation of tools tends to avoid having to be bent, instead they give you a freedom of choice. The freedom to analyse what your data is like, and what the right tool for your specific use case is.

Does that mean more work on your end? It sure does, but for the love of your data, it will be worth it. If you find the right partner, and it makes your life easier, it's a win on both ends. You'll be a happy developer (well, most of the time), and your data will be able to roam free, running naked across a meadow, hand in hand with the tool you chose.

Now I'm well aware that this sounds all bloomy, but that's what it boils down to. The choice is now up to you, that's why it's important to know what's out there, to play with the tools available, to know how and why they're different from each other.

To get you started, have a look at Vineet Gupta's excellent overview of the NoSQL landscape.

Don't Believe Everything You Hear

If someone tells you that you should try a specific tool, ask him why. If the answer is speed, or because it's written in Erlang and scales insanely well, it's time to call bullshit on him. MySQL can be fast too, that's not an issue. It's nice to be able to have a database that you can scale up to hundreds of nodes easily, but while the technology behind it is very interesting, and sometimes mind-blowing, it doesn't help if it's a pain to work with, or if there's no library support yet. Sure you could write your own, but if you're totally new to the field, you usually just want to play and learn. Hard thing to do if all you get is just an API and a very limited language support.

If someone tells you that e.g. MongoDB is fast, then there's reason for that, and it's good to be well aware of it and what consequences it has for operating your application. If someone tells you that CouchDB is awesome for building web applications, because it's built of the web, they're leaving out that a common use case like pagination is still an awful thing to implement with it. If someone tells you that Cassandra scales easily because it was built at Facebook, they're leaving out that its peculiar way of storing and accessing data is very specific to how sites like Facebook need to access their data. I could go on and on about it, but there's always two side of a story.

Before judging a tool based on just the one side, look at the other side too. It might not be as big of a problem as you thought it would, either way, you know why things are the way they are. Look for tools with sites describing particular use cases, or areas where they're just not a good fit. If the tool builders aren't aware of use cases, strengths and weaknesses, how will you be?

In the end, even though they can be problems for others, they don't necessarily are problems for you. Your particular use case might be just fine with the downsides, but on the other hand gaining high profit from the upsides. If it's not, at least you're more than free to go look somewhere else. At least now you have the (free) options to do so.

There's misconceptions out there being close to urban myths, and we're only two years or so into working with the new generation of tools. The only way you can avoid falling into a trap is to play with what's out there, to know their weaknesses and strengths. The only thing we can do to avoid having people fall into the trap is to better educate them, to give them real-world examples, use cases other than tagging and blog posts. Just saying that it scales better than xyz is not an argument, it's educating people on the wrong end.

It's Not About Speed And Scaling

If speed and scaling were our only problems, we'd be left in a big world of pain. As beautiful as these words are, I'm gonna go out on a limb and say that it's not a problem until it's a problem. Unless you're already Facebook or LinkedIn, you don't need to have that as a main factor when choosing the right tool. Sure, it's better if there's an easy way to scale up in the future, but what's the point if you needs days to get a good set up before having written a single line of code?

Most NoSQL tools were built with some sort of scaling in mind, although people tend to easily confuse scaling, distribution, sharding and partitioning, so you're safe in most cases when it comes to the point where your application needs to handle more traffic.

I'm gonna go ahead and venture the guess that if you're deciding solely based on speed and scalability, you're doing it wrong. And I rarely use that phrase. You should be deciding based on the core feature set, why it does the things it does, and what consequences it'd have on your life as a developer.

Don't Compare Apples And Oranges

No tool is like the other. Just comparing e.g. MongoDB, Redis and MySQL is the wrong way to approach your problem, especially if you just look at speed and comparing their feature set. Feature sets and speed are usually different for a reason. Instead you should be comparing every tool with your data. How much do you need to bend the data to store and access it easily. Is it even possible to store it efficiently and or your particular use case? Are potential trade-offs (e.g. data duplication to gain speedier access) worth risking? Is it the right fit in the way it handles updates, associations, writes, reads and queries in terms of your data and application? Then go right ahead and use it.

But don't just compare tools with each other whose only feature they have in common is the fact that they can store data, or things that are mostly depending on an application's specific needs. To give you an idea, this guy compares Redis and MongoDB by implementing a particular use case with both of them. That's the way you should be comparing tools.

The Heat Is On

We're going to see more tools popping up left and right, making it harder to keep up, and to make an informed decision. What I consider the best thing about most of them is that they're free. You can grab the source code, improve it or just look at how it handles your data. That's what makes them so awesome, their incentive is not to constrain your data, they're as open as possible about it, some tools even going as far as building solely on open standards to implement their whole stack (that'd be CouchDB if you're curious).

The whole point of this post is that it's up to you to find the perfect tool to hand your data to. I don't know about you, but me being able to find the right fit instead of squeezing my data into a database that tries to solve all problems at once, that's the most exciting prospect of post-relational databases for me. Our common goal should be to help people make that decision without getting too passionate about any particular tool. They all exist to fulfill some purpose, and we should be telling people about them.

There's a couple of sites to keep an eye on, e.g. MyNoSQL by Alex Popescu, he's keen on keeping up-to-date with what's going on in the NoSQL community. Another site with a growing collection of links to articles is nosql-databases.org. EngineYard published a series of blog posts on key-value stores in Ruby, in particular Cassandra, Redis, MongoDB, CouchDB, LDAP that's well worth checking out to get an idea of what's out there.

Tags: nosql

I like to think that there's never been a more exciting time when it comes to playing with new technologies. Sure, that's a bit selfish, but that's just how I feel. Doing Java after I got my diploma was interesting, but it wasn't exciting. Definitely not compared to the tools that keep popping up everywhere.

One "movement" (if you can even call it that) is NoSQL. I've never been particularly happy with relational databases, and I happily dropped MySQL and the like when an opportunity to work with something entirely new came up. Since it's my own project I'm not putting anything at risk, and I don't regret taking that step. We're working with two members of the NoSQL family in particular, CouchDB and Redis.

Last week people interested in and people working with and on these new and pretty fascinating tools came together for the first NoSQL meetup in Berlin. I talked about Redis, and before I keep blabbering on about it, here are my slides. The talks have been filmed, so expect an announcement for the videos soon-ish.

I was up against a tough competition, including CouchDB, Riak and MongoDB (but we're all friends, no hard feelings). During my talk, I might've overused the word awesome. But after all the talks were over, it hit me: Redis is awesome. It seriously is. Not because it does a lot of things, is distributed, written in Erlang (it's written in old-school, wicked fast C), has support for JSON (though that's planned), and all that stuff. No, it's awesome because it does only a very small set of work for you, but it does it extremely well, and wicked fast. I don't know about you, but I like tools like that. I took a tour of the C code last week, and even though my skills in that area are a bit rusty, it was quite pleasant to read, and easy to follow the flow.

I like Redis, and while I don't ask you to love it too, do yourself a favor and check it out. It gives Memcached a serious run for its money. Everyone loves benchmarks, and I do to, but I'm careful not reading too much into them. I ran Redis and Memcached through their paces, using the available Ruby libraries. I tried both the C-based and and Ruby-based version for Memcached, and the canonical Ruby version for Redis. It's like a cache, with sugar sprinkled on top.

Without putting out any numbers, let me just say that, first it's a shame Rails is shipped with the Ruby version of the Memcached library, because it is sloooow. Okay not so slow you should be worried, but slower than the competition. Second, Redis clocks in right in the middle between both Memcached libraries. While it's faster than memcache-client, it's still a bit slower than memcached. Did I mention that the library for Redis is pure Ruby? Pretty impressive, especially considering what you get in return. Sit back for a moment, and think about how much work went into Memcached already, and how young Redis still is. Oh the possibilities.

Redis is more than just a key-value store, it's a lifestyle. No wait, that's something different. But it still requires you to think differently. Shouldn't be a surprise really, most of the new generation of data stores do. It takes any data you give to it, and you're good to go as long as it fits into your memory. Let me tell you, that's still a lot of data. Salvatore is constantly working on new features for Redis, so keep an eye on its GitHub repository. If you thought that pushing and popping elements atomically off lists was cool, there might be a big warm surprise for you in the near future.

I first came across it using Nanite, where it's used to store the state of the daemon cluster. Running it through its paces in preparation for the talk I realized how underused it is. For our use case, Redis is the perfect place to store stuff like history of system data, e.g. CPU usage, load, memory usage and the like. It's also a great fit for a worker queue, but since we have RabbitMQ in place, there's no need for that.

When you look at it closely, there's heaps of uses for Redis. Chris Wanstrath wrote about how he used it writing hurl, and Simon Willison also published a love letter to Redis, there's also more info on how you use it with the Ruby library over at the EngineYard blog, and James Edward Grey published a whole serious on how to install, setup and use Redis with Ruby. Just like CouchDB I want to put Redis to more uses in the future. That doesn't mean I'm looking to find a problem for a solution, it just means that when I have a problem I'm gonna consider my options, and Redis is one of them. It's a perfect mix between a simple yet insanely speedy data store, but with the little twist that is Redis' way of persisting data.

Tags: redis, nosql

June was an exhausting month for me. I spoke at four different conferences, two of which were not in Berlin. I finished the last talk today, so time to reciprocate on conferences and talks. In all I had good fun. It was a lot of work to get the presentations done (around 400 single slides altogether), but in all I would dare say that it was all more than good practice to work on my presentation skills and to loose a bit of the fear of talking in front of people. But I'll follow up on that stuff in particular in a later post.

RailsWayCon in Berlin

I have to admit that I didn't see much of the conference, I mainly hung around, talked to people, and gave a talk on Redis and how to use it with Ruby. Like last year the conference was mingled in with the International PHP Conference and the German Webinale, a somewhat web-related conference. I made a pretty comprehensive set of slides for Redis, available for your viewing pleasure.

Berlin Buzzwords in Berlin

Hadoop, Lucene, NoSQL, Berlin Buzzwords had it all. I spent most of my time in the talks on the topics around NoSQL, having been given the honor of opening the track with a general introduction on the topic. I can't remember having given a talk in front of this many people. The room took about 250, and it seemed pretty full. Not tooting my own horn here, I've never been more anxious before a talk of how it would go. Obviously there were heaps of people in the room who have only heard of the term, and people who work with or on the tools on a daily basis. Feedback was quite positive, so I guess it turned out pretty okay. Rusty Klophaus wrote two very good recaps of the whole event, read on about day one and day two.

The slide set for my talk has some 120 slides in all, trying to give a no-fuss overview of the NoSQL ecosystem and the ideas and inspirations. There's some historical references in the talk, because in general the technologies aren't revolutionary, they use ideas that've been around for a while and combine them with some newer ones. Do check out the slides for some more details on that.

MongoUK in London

10gen is running MongoDB related conferences in a couple of cities, one of them in London, where I was asked to speak on something related to MongoDB. Since I'm all about diversity, that's pretty much what I ended up talking about, with a hint of MongoDB sprinkled on top of it. Document databases, the web, the universe, all the philosophical foundation knowledge you could ask for. I talked about CouchDB, Riak, and about what makes MongoDB stand out from the rest.

Most enjoyable about MongoUK was to hear about real life experiences of MongoDB users, what kind of problems they had and such. Also, I finally got to see some of London and meet friends, but I'll write more about that (and coffee) on my personal blog. Again, the slide set is available for your document database comparison pleasure.

Cloud Expo Europe in Prague

Just two 36 hours after I got back from London I jumped on the train to Prague to speak about MongoDB at Cloud Expo Europe. Cloud is something I can get on board with (hint: Scalarium), so why the hell not? It turned out to be a pretty enterprisey conference, but still, got some new food for thought on cloud computing in general.

I already gave a talk on MongoDB at Berlin's Ruby brigade, but I built a different slide set this time, improving on the details I found to be a bit confusing at first. Do check out the slides, if you don't know anything about MongoDB yet, it should give you a good idea.

Showing off

As you'll surely notice, my slides are all websites, and not on Slideshare. Two months ago I looked into Scott Chacon's Showoff, a tool to build web-based presentations that simply run as tiny JavaScript apps in the browser. I very much like that idea, because even though Keynote is still the king of the crop, it's still awful. Using Markdown, CSs and JavaScript appeals much more to the geek in my. It's so easy to crank out slides as simple text, and worry about the styling later. Plus, I can easily keep my slides in Git, and who doesn't enjoy that? I'd very much recommend giving it a go. If you want to look at some sources, all my talks and their sources are available on the GitHubs, MongoDB, Redis, NoSQL, document databases and again MongoDB.

It's a pleasure to build slides with Showoff, and it has helped me focus my slides on very short phrases and as few bullet points as possible. Sure, it's not Keynote and doesn't have all the fancy features, but I noticed that it forced me to focus more, and that keeping slides short helped me stay focussed, but again, more on that in a follow-up post.

Feel free to use my slides as inspiration to play with Showoff, there's surprisingly little magic involved. Also, if you think I should speak at a conference you know of or that you're organising, do get in touch.

The idea of building and storing user timelines (think Twitter) in Riak confused me at first. It sounds like such a spot-on case for time series databases. Yet Yammer managed to make the idea pretty popular. The whole thing lacked implementation though, because they kept their to themselves, which I don't blame them for at all.

Apple Mac Timeline

So let's have a look at how simple it is to build one. You can see a timeline of all Apple Macintosh products above, but that's not what we want to build.

Instead we want to build something like the Twitter timeline. A user follows many other users, and wants to look at a feed built from their activities, so something like the timeline shown below.

Twitter timeline

How do you model a timeline in Riak?

For every user we store one object in Riak. Every timeline contains a list of tweet ids, or whatever activity you're referencing, or it can contain the whole tweets. Something like this should work:

If you want to store more data, turn the list into an array of hashes containing whatever information is necessary to rebuild the timeline later.

Adding entries to the timeline

To add new entries to a timeline, prepend them to the existing list of entries, here's some Ruby code to show how it's done.

The code assumes you take care of followership somewhere else. You can store that data in Riak too, but the code is oblivious to its whereabouts.

Conflicts, siblings, oh my!

The fun starts when two clients update the same timeline, you get a conflict and siblings. The strength of a simple data structure like the example above is that they're easy to merge together while still keeping ordering based on the ids. The ids are ordered only in this example, Twitter somewhat makes sure they are.

When you get a conflict, a smart Riak library like Ripple helps you find out about it. To add on the earlier example, here's a version of add that detects conflicts.

Suddenly you have two or more objects instead of one, each containing a different timeline. To turn them into a single list, you merge all of them together, discard the duplicates, and restore order based on the id. Here's some Ruby to do that.

You iterate over all timeline objects and keep adding unique activities to a new list, returning that when done.

Sort, and done!

All that's left to do is sort the result.

There's the whole code. To spare you the pain of having to write your own library, all this is bundled into a gem called riaktivity. If you're doing Python, the Brett Hoerner has got you covered with timak. Be sure to watch the original talk by Yammer, it's well worth it.

There's ups and downs to this approach, and things that need to be taken care of. More on that and modeling data for Riak in general is covered in the Riak Handbook, the definitive guide on Riak.

Tags: riak, nosql

A very valid question is: What's a good use case for Redis? There's quite a few, as Redis isn't your every day key-value store, it allows you to keeps lists and sets in your datastore, and to run atomic operations on them, like pushing and popping elements. All that stuff is incredibly fast, as obviously your data is held in memory and only persisted to the hard disk if necessary and to top it off, asynchronously, while not reducing the throughput of the server itself.

The simplest and most obvious use case is a cache. Redis clocks in at almost the speed of Memcached, with a couple of features sprinkled on top. If you need a cache, but maybe have a use case where you want also want to store data you in it that you want to be persisted, Redis is a decent tool for your caching needs. If you already have a Memcached instance in place I'd look at my options before adding a new component to my infrastructure though.

Pushing and popping elements atomically, does that ring a bell? Correct, that's what you want from a worker queue. Look at delayed_job, you'll find that it uses a locking column in your jobs table. Some people argue that a database should not be the place where you keep your worker jobs. Up to a certain amount of work I disagree, but at some point the performance costs outweigh the benefits, and it's time to move on. Redis is a perfect fit here. No locking needed, just push on the list of jobs and pop back off it in your workers, simple like that. It's the GitHub way, and the more I think about it, the more sense it makes.

For Redis 1.1 Salvatore has been working on a proposal by Ezra from Engine Yard to implement a command that would move items from one list to another in one step, atomically. The idea is to mark a job as in progress, while not removing it entirely from the data storage. Reliable messaging anyone? It's such a simple yet genius idea, and Redis has most of the functionality already in place. There's heaps more planned for future Redis releases, I'd highly recommend keeping an eye on the mailing list and on Salvatore's Twitter stream.

As I'm sure you noticed Redis is used for data storage in hurl, a neat little app to debug HTTP calls. Redis is simply used to store your personalized list of URLs you checked. Should some data be lost in between database dumps, it's not a big deal, it's not great sure, but not a big deal.

The simple answer for when to use Redis is: Whenever you want to store data fast that doesn't need to be 100% consistent. In the past projects I've worked on that includes classic examples of web application data, especially when there's social stuff sprinkled on top: ratings, comments, views, clicks, all the social stuff you could think of. With Redis, some of it is just a simple increment command or pushing something onto a list. Here's a nice example of affiliate click tracking using Rack and Redis.

Why is that a good match? Because if some of that data is lost, it doesn't make much of a difference. Throw in all the statistical or historial data you can think of that's accumulated in some way through your application, and could be recalculated if necessary. That data usually just keeps clogging up your database, and is harder and harder to get rid of as it grows.

Same is true for activity streams, logging history, all that stuff that is nonvolatile yet doesn't need to be fully consistent, where some data loss is acceptable. You'd be surprised how much of your data that includes. It does not, and let me be perfectly clear on that, include data that involves any sort of business transaction, be it for a shopping platform or for data involved in transactions for software as a service applications. While I don't insist you store that data in a relational database, at least it needs to go into a reliable and fully recoverable datastore.

One last example, the one that brought me and Redis together is Nanite, a self-assembling fabric of Ruby daemons. The mapper layer in Nanite keeps track of the state of the daemons in the cluster. That state can be kept on each mapper redundantly, but better yet, it should be stored in Redis. I've written a post about that a while back, but it's still another prime use for Redis. State that, should it or part of it get lost, will recover all by itself and automatically (best case scenario, but that's how it works in Nanite).

One thing to be careful though is that Redis can only take as much data as it has memory available. Especially for data that has the potential to grow exponential with users and their actions in your application, it's good to keep an eye on it and to do some basic calculations, but you should even do that when using something like MySQL. When in doubt, throw more memory at it. With Redis and its master-slave replication it's very easy to add a new machine with more memory, do one sync and promoto the slave to the new master within a matter of minutes. Try doing that with MySQL.

For me, this stuff is not about just being awesome. I've had countless situation where I had data that could've been handled more elegantly using something like Redis, or a fully persistent key-value store like Tokyo Tyrant. Now there's really no excuse to get that pesky data clogging up your database out of there. These are just some examples.

By the way, if you want to know what your Redis server is doing, telnet to your Redis instance on port 6379, and just enter "monitor". Watch in awe as all the commands coming in from other clients appear on your screen.

In the next post we'll dig into how you can store data from your objects conveniently into Redis.

Redis, consider us for your next project.

Tags: nosql, redis

Call it NoSQL, call it post-relational, call it what you like, but it's hard to ignore that hings are happening in the database world. A paradigm shift is not too far ahead, and it's a big one, and I for one am welcoming our post-relational overlords. Whatever you call them, CouchDB, MongoDB (although you really shouldn't call a database MongoDB), Cassandra, Redis, Tokyo Cabinet, etc. I'm well aware that they're not necessarily all the same, but they do try to fill similar gaps. Making data storage easy as pie, offering data storage fitting with the kind of evolving data we usually find on the web.

The web is slowly moving away from relational databases and with that, SQL. Let me just say it upfront: I hate SQL. I physically hate it. It doesn't fit with my way of thinking about problems, and it doesn't fit with the web. That's my opinion anyway, but I'm sure I'm not alone with it.

There is however one area where object-oriented databases failed, and where the new generation of document databases will have similar problems. You could argue that object-oriented databases are in some way a predecessor to modern post-relational databases, they made storing objects insanely easy, no matter how complex they were, and they made navigating through objects trees even easier and insanely fast. Which made them applicable to some problems, but they weren't flexible enough in my opinion. But they still laid some groundwork.

skitched-20090908-165230.jpg

It's mainly concerning The Enterprise and their giant collection of reporting tools. Everybody loves tools, and The Enterprise especially loves them. The more expensive, the better. Reporting tools are the base for those awesome pie charts they just love to fill entire PowerPoint presentations with. They work on "standardized" interfaces and languages and therefore, with SQL.

I've worked on a project were we switched from an object-oriented to a relational database just because of that. Sure, there's proprietary query languages, or there's JQL when you're into JDO, EJB3 and the like. But they're nowhere as powerful as SQL is. They're also not as brain-twisting. That should be a good thing really, but there you have it.

NoSQL databases are facing a similar dilemma. Just like object-oriented databases they're awesome for just dumping data in it, more or less structured. It's easy to get them out too, and it's usually easy to aggregate the data in some way. Is it a big deal? Of course not, at least not in my opinion. But if it is some sort of deal, what can you do to work around that?

  • Ignore it. Simple, isn't it? The reporting requirement can usually be solved in a different way. Sure, it can be more work, but usually reporting is less of a killer than some might think. Give the client some way to express a query and let him at it. Give him a spare instance of your replicated database, and let him work off that data. Best thing you could do is pre-aggregate it as much as possible so there's less work for the client.

  • If you really need structured data in a relational database, consider replicating the data into one from your post-relational database of choice. I can hear you say: That guy's crazy, that'd involve so much work keeping the two in sync! No, it wouldn't. Create a fresh dump every time you need a current dataset, and dump it into your SQL database. Simple like that.

  • Put an interface in front of the new database. Yes, it's insane, but I've done it, and it works. It doesn't have to be an SQL interface, just a common interface that works with one set of reporting tools. Yes, it's not ideal, but it's an option.

  • Don't ignore it, keep using a relational database. Yep, not all of us are lucky enough, someone still has to serve the market demands. Legacy projects or clients are forcing us to stick with the old and the dusty model of storing and retrieving data. Quite a lot of people are happy with that, but I'm not.

I'm sure there's other options, these are just off the top of my head, and I can say that I've practiced all of them with more or less good results.. I for one am sick of still having to use MySQL on new projects. I've had my fun with it, and sure there's a whole bunch of patches that make it a bit more fun, but it's still MySQL. Yes, I am aware that there's PostgreSQL, but it's the same story. Old, old and old.

Should you still try to get a new generation database into new projects? Yes, yes and yes, you definitely should. Consider yourself lucky if you succeed, because you're still an early awesome adopter. Even use SimpleDB if you must, but maybe reconsider before you really use it, it's not great. But don't lie to your clients, they should be aware what they're getting into. It's no big deal, but the bigger they are the more likely they have administrators not yet familiar with the new tools. But the more people start using them now, the better they'll get before they hit the mainstream. Which they will eventually, rest assured. I'm ready, the web is ready, and the tools are ready. What about you?

Tags: nosql, databases

Last weekend I tweeted two links to two tweets by a poor guy who apparently got his MongoDB database into an unrecoverable state during shutdown whilst upgrading to a newer version. That tweet quickly made the rounds, and the next morning I saw myself staring at replies stating that it was all his fault, because he 1.) used kill -9 to shut it down because apparently the process hung (my guess is it was in the middle of flushing all data to disk) and 2.) didn't have a slave, just one database instance.

Others went as far as indirectly calling him an idiot. Oh interwebs, you make me sad. If you check out the thread on the mailing list, you'll notice a similar pattern in reasoning. The folks over at http://learnmongo.com seem to want to be the wittiest of them all, recommending to always have a recent backup, a slave or replica set and to never kill -9 your database.

While you can argue that the guy should've known better, there's something very much at odds here, and it seems to become a terrifying meme with fans of MongoDB, the idea that you need to do all of these things to get the insurance of your data being durable. Don't have a replica? Your fault. kill -9 on a database, any database? You mad? Should've read the documentation first, dude. This whole issue goes a bit deeper than just reading documentation, it's the fundamental design decision of how MongoDB treats your data, and it's been my biggest gripe from the get go. I can't help but be horrified by these comments.

I've heard the same reasoning over and over again, and also that it just hasn't happened so far, noone's really lost any considerable data. The problem is, most people never talk about it publicly, because it's embarrassing, best proof is the poor guy above. This issue is not even related to MongoDB, it's a general problem.

Memory-Mapped Persistence

But let me start at the beginning, MongoDB's persistence cycle, and then get to what's being done to improve its reliability and your data's durability. At the very heart, MongoDB uses memory-mapped files to store data. A memory-mapped file is a data structure that has the same representation on disk as it has when loaded into memory. When you access a document in MongoDB, loading it from disk is transparent to MongoDB itself, it can just go ahead and write to the address in memory, as every database in MongoDB is mapped to a dynamically allocated set of files on disk. Note that memory-mapped files are something you won't find in a lot of other databases, if any at all. Most do their own house-keeping and use custom data structures for that purpose.

The memory mapping library (in MongoDB's case the POSIX functions, and whatever Windows offers in that area) will take care of handling the flush back to disk every 60 seconds (configurable). Everything in between happens solely in memory. Database crash one second before the flush strikes again? You just lost most of the data that was written in the last 59 seconds. Just to be clear, the flushing cycle is configurable, and you should consider choosing a better value depending on what kind of data you're storing.

MongoDB's much praised insert speed? This is where it comes from. When you write stuff directly to local memory, they better be fast. The persistence cycle is simple: accept writes for 60 seconds, then flush the whole thing to disk. Wait for another 60 seconds, then flush again, and so on. Of course MongoDB also flushes the data when you shut it down. But, and here's the kicker, of course that flush will fail when you kill it without mercy, using the KILL signal, just like the poor guy above did apparently. When you kill something that writes a big set binary data to disk, all bets are off. One bit landing on the wrong foot and the database can get corrupted.

Database Crashes are Unavoidable

This scenario can and does happen in e.g. MySQL too, it even happens with CouchDB, but the difference is, that in MySQL you usually only have a slightly damaged region, which can be fixed by deleting and re-inserting it. In CouchDB, all that happens is that your last writes may be broken, but CouchDB simply walks all the way back to the last successful write and runs happily ever after.

My point here is simple: even when killed using the KILL signal, a database should not be unrecoverable. It simply shouldn't be allowed to happen. You can blame the guy all you want for using kill -9, but consider the fact that it's the process equivalent of a server or even just the database process crashing hard. Which happens, believe it or not.

Yes, you can and probably will have a replica eventually, but it shouldn't be the sole precondition to get a durable database. And this is what horrifies me, people seem to accept that this is simply one of MongoDB's trade-offs, and that it should just be considered normal. They shouldn't, it needs more guys like the one causing all the stir bringing up these isses, even though it's partly his fault, to show the world what can happen when worse comes to worst.

People need to ask more questions, and not just accept answers like: don't use kill -9, or always have a replica around. Servers crash, and your database needs to be able to deal with it.

Durability Improvements in MongoDB 1.7/1.8

Now, the MongoDB folks aren't completely deaf, and I'm happy to report they've been working on improvements in the area of data durability for a while, and you can play with the new durability option in the latest builds of the 1.7 branch, and just a couple of hours ago, there was activity in improving the repair tools to better deal with corrupted databases. I welcome these changes, very much so. MongoDB has great traction, a pretty good feature set, and the speed seems to blow peoples' minds. Data durability has not been one of its strengths though, so I'm glad there's been a lot of activity in that area.

If you start the MongoDB server with the new --dur option, and it will start keeping a journal. When your database crashed, the journal is simply replayed to restore all changes since the last successful flush. This is not a particularly special idea, because it's how your favorite relation database has been working for ages, and not unsimilar to the storage model of other databases in the NoSQL space. It's a good trade-off between keeping good write speed and getting a much more durable dataset.

When you kill your database harshly in between flushes with a good pile of writes in between, you don't lose a lot of data anymore, maybe a second's worth (just as you do with MySQL when you use InnoDB's delayed flushing), if any at all, but not much more than that. Note that these are observation based on a build that's now already more than a month old. Situation may have improved since then. Operations are put into a buffer in memory, from where they're both logged to disk into the journal, and then applied to the dataset. When writing the data to memory, it has already been written to the journal. Journals are rotated once they reach a certain size and it's ensured that all their data has been applied to the dataset.

A recovery process applies all uncommitted changes from the log when the database crashes. This way it's ensured that you only lose a minimum set of data, if none at all, when your database server crashes hard. In theory the journal could be used to restore a corrupted in a scenario as outlined above, so it's pretty neat in my opinion. Either way, the risk of losing data is now pretty low. In case your curious for code, the magic happens in this method.

I for one am glad to see improvements in this area of MongoDB, and I'm secretly hoping that durable will become the default mode, though I don't see it happening for marketing reasons anytime soon. Also, be aware that durability brings more overhead. In some initial tests however, the speed difference between non-durable and durable MongoDB was almost not worth mentioning, though I wouldn't call them representative, but in general there's no excuse to not use it really.

It's not yet production ready, but nothing should keep you from playing with it to get an idea of what it does.

Bottom Line

It's okay to accept trade-offs with whatever database you choose to your own liking. However, in my opinion, the potential of losing all your data when you use kill -9 to stop it should not be one of them, nor should accepting that you always need a slave to achieve any level of durability. The problem is less with the fact that it's MongoDB's current way of doing persistence, it's with people implying that it's a seemingly good choice. I don't accept it as such. If you can live with that, which hopefully you don't have to for much longer anyway, that's fine with me, it's not my data anyway. Or maybe I'm just too paranoid.

By now it should be obvious that I'm quite fond of alternatives data stores (call them NoSQL if you must). I've given quite a few talks on the subjects recently, and had the honor of being a guest on the (German) heise Developer Podcast on NoSQL.

There's some comments and questions that pop up every time alternative databases are being talked about, especially by people deeply rooted in relational thinking. I've been there, and I know it requires some rethinking, and also am quite aware that there are some controversial things that basically are the exact opposite of everything you learned in university.

I'd like to address a couple of those with some commentary and my personal experience (Disclaimer: my experience is not the universal truth, it's simply that: my experience, your mileage may vary). When I speak of things done in practice, I'm talking about how I witnessed things getting done in Real Life™, and how I've done them myself, both good and bad. I'm focussing on document databases, but in general everything below holds true for any other kind of non-relational database.

It's easy to say that all the nice features document databases offer are just aiming for one thing, to scale up. While that may or may not be true, it just doesn't matter for a lot of people. Scaling is awesome, and it's a problem everyone wants to solve, but in reality it's not the main issue, at least not for most people. Also, it's not an impossible thing to do even with MySQL, I've had my fun doing so, and it sure was an experience, but it can be done.

It's about getting stuff done. There's a lot more to alternative databases in general, and document databases in particular, that I like, not just the ability to scale up. They simply can make my life easier, if I let them. If I can gain productivity while still being aware of the potential risks and pitfalls, it's a big win in my book.

What you'll find, when you really think about it, is that everything below holds true no matter what database you're using. Depending on your use case, it can even apply to relational databases.

Relational Databases are all about the Data

Yes, they are. They are about trying to fit your data into a constrained schema, constrained in length, type, and other things if you see fit. They're about building relationships between your data in a strongly coupled way, think foreign key constraints. Whenever you need to add data, you need to migrate your schema. That's what they do. They're good at enforcing a set of ground rules on your data.

See where I'm going with this? Even though relational databases tried to be a perfect fit for data, they ended up being a pain once that data needed to evolve. If you haven't felt that pain yet, good for you. I certainly have. Tabular data sounds nice in theory, and is pretty easy to handle in Excel, but in practice, it causes some pain. A lot of that pain stemmed from people using MySQL, yes, but take that argument to the guy who wrote it and sold it to people as the nicest and simplest SQL database out there.

It's easy to get your data into a schema once, but it gets a lot harder to change the schema and the data into a different schema at a later point in time. While data sticks around, the schema evolves constantly. Something relational databases aren't very good at supporting.

Relational Databases Enforce Data Consistency

They sure do, that's what they were built for. Constraints, foreign keys, all the magic tricks. Take Rails as a counter-example. It fostered the idea that all that stuff is supposed to be part of the application, not the database. Does it have trade-offs? Sure, but it's part of your application. In practice, that was correct, for the most part, although I can hear a thousand Postgres users scream. There's always an area that requires constraints on the database level, otherwise they wouldn't have been created in the first place.

But most web applications can live fine without it, they benefit from being free about their data, to shape it in whichever way they like, adding consistency on the application level. The consistency suddenly lies in your hands, a responsibility not everyone is comfortable with. You're suddenly forced to think more about edge cases. But you sure as hell don't have to live without consistent data, quite the opposite. The difference is that you're taking care of the consistency yourself, in terms of your use case, not using a generic one-fits-all solution.

Relationships between data aren't always strict. They can be loosely linked, what's the point of enforcing consistency when you don't care if a piece of data still exists or not? You handle it gracefully in your application code if you do.

SQL is a Standard

The basics of SQL are similar, if not the same, but under the hood, there's subtle differences. Why? Because under the hood, every relational database works differently. Which is exactly what document databases acknowledge. Every database is different, trying to put a common language on top will only get you so far. If you want to get the best out of it, you're going to specialize.

Thinking in Map/Reduce as CouchDB or Riak force you to is no piece of cake. It takes a while to get used to the ideas around it and what implications it has for you and your data. It's worth it either way, but sometimes SQL is just a must, no question. Business reporting can be a big issue, if your company relies on supporting standard tools, you're out of luck.

While standards are important, in the end it's important what you need to do with your data. If a standard gets in your way, how is that helpful? Don't expect a standard query language for document databases any time soon. They all solve different types of problems in different ways, and they don't intend to hide that from you with a standard query language. If on the other hand, all you need is a dynamic language for doing ad-hoc queries, check out MongoDB.

Normalized Data is a Myth

I learned a lot in uni about all the different kinds of normalization. It just sounded so nice in theory. Model your data upfront, then normalize the hell out of it, until it's as DRY as the desert.

So far so good. I noticed one thing in practice: Normalized data almost never worked out. Why? Because you need to duplicate data, even in e-commerce applications, an area that's traditionally mentioned as an example where relational databases are going strong.

Denormalizing data is simply a natural step. Going back to the e-commerce example, you need to store a lot of things separately when someone places an order: Shipping and billing address, payment data used, product price and taxes, and so on. Should you do it all over the place? Of course not, not even in a document database. Even they encourage storing similar data to a certain extent, and with some of them, it's simply a must. But you're free to make these decisions on your own. They're not implying you need to stop normalizing, it still makes sense, even in a document database.

Schemaless is not Schemaless

But there's one important thing denormalization is not about, something that's being brought up quite frequently and misunderstood easily. Denormalization doesn't mean you're not thinking about any kind of schema. While the word schemaless is brought up regularly, schemaless is simply not schemaless.

Of course you'll end up with having documents of the same type, with a similar set of attributes. Some tools, for instance MongoDB, even encourage (if not force) you to store different types of documents in different collections. But here's the kicker, I deliberately used the word similar. They don't need to be all the same across all documents. One document can have a specific attribute, the other doesn't. If it doesn't, just assume it's empty, it's that easy. If it needs to be filled at some point, write data lazily, so that your schema eventually is complete again. It's evolving naturally, which does sound easy, but in practice requires more logic in your application to catch these corner cases.

So instead of running migrations that add new tables and columns, and in the end pushing around your data, you migrate the data on the next access, whether that's a read or a write is up to your particular use case. In the end you simply migrate data, not your schema. The schema will evolve eventually, but first and foremost, it's about the data, not the constraints they live in. The funny thing: In larger projects, I ended up doing the same thing with a relational database. It's just easier to do and gentler on the load than running a huge batch job on a production database.

No Joins, No Dice

No document database supports joins, simple like that. If you need joins, you have two options: Use a database that supports joins, or adapt your documents so that they remove the need for joins.

Documents have one powerful advantage: It's easy to embed other documents. If there's data you'd usually fetch using a join, and that'd be suitable for embedding (and therefore oftentimes: denormalizing), there's your second option. Going back to the e-commerce example: Whereas in a relational database you'd need a lot of extra tables to keep that data around (unless you're serializing it into single column), in a document database you just add it as embedded data to the order document. You have all the important data one in place, and you're able to fetch it in one go. Someone said that relational databases are a perfect fit for e-commerce. Funny, I've worked on a market platform, and I've found that to be a ludicrous statement. I'd have benefited from a loser data storage several times, joins be damned.

It's not always viable, sure, and it'd be foolish to stick with a document database if that's an important criterion for your particular use case, then no dice. It's relational data storage or bust.

Of course there's secret option number three, which is to just ignore the problem until it's a problem, just by going with a document database and see how you go, but obviously that doesn't come without risks. It's worth noticing though that Riak supports links between documents, and even fetching linked documents together with the parent in one request. In CouchDB on the other hand, you can emit linked documents in views. You can't be fully selective about the document data you're interested in, but if all you want is fetch linked documents, there is one or two ways to do that. Also, graph databases have made it their main focus to make traversal of associated documents an incredibly cheap operation. Something your relational database is pretty bad at.

Documents killed my Model

There's this myth that you just stop thinking about how to model your data with document databases or key-value storage. That myth is downright wrong. Just because you're using schemaless storage doesn't mean you stop thinking about your data, quite the opposite, you think even more about it, and in different ways, because you simply have more options to model and store it. Embedding documents is a nice luxury to have, but isn't always the right way to go, just like normalizing the crap out of a schema isn't always the way to go.

It's a matter of discipline, but so is relational modelling. You can make a mess of a document database just like you can make a mess of a relational database. When you migrate data on the fly in a document database, there's more responsibility in your hands, and it requires good care with regards to testing. The same is true for keeping track of data consistency. It's been moved from the database into your application's code. Is that a bad thing? No, it's a sign of the times. You're in charge of your data, it's not your database's task anymore to ensure it's correct and valid, it's yours. With great power comes great responsibility, but I sure like that fact about document databases. It's something I've been missing a lot when working with relational databases: The freedom to do whatever the heck I want with my data.

Read vs. Write Patterns

I just like including this simply because it always holds true, no matter what kind of database you're using. If you're not thinking about how you're going to access your data with both reads and writes, you should do something about that. In the end, your schema should reflect your business use case, but what good is that when it's awkward to access the data, when it takes joins across several tables to fetch the data you're interested in?

If you need to denormalize to improve read access, go for it, but be aware of the consequences. A schema is easy to build up, migrating on the go, but if document databases force you to do one thing, and one thing only, it's to think about how you're reading and writing your data. It's safe to say that you're not going to figure it all out upfront, but you're encouraged to put as much effort into it as you can. When you find out you're wrong down the line, you might be surprised to find that they make it even easier to change paths.

Do your Homework

Someone recently wrote a blog post on why he went back to MySQL from MongoDB, and one of his reasons was that it doesn't support transactions. While this is a stupid argument to bring up in hindsight, it makes one thing clear: You need to do research yourself, noone's going to do it for you. If you don't want to live up to that, use the tools you're familiar with, no harm done.

It should be pretty clear up front what your business use case requires, and what tools may or may not support you in fulfilling these requirements. Not all tool providers are upfront about all the downsides, but hey, neither was MySQL. Read up, try and learn. That's the only thing you can do, and noone will do it for you. Nothing has changed here, it's simply becoming more obvious, because you suddenly have a lot more options to work with.

Polyglot Data Storage

Which brings me to the most important part of them all: Document databases (and alternative, non-relational data stores in general) are not here to replace relational databases. They're living alongside of them, with both sides hopefully somewhat learning from each other. Your projects won't be about just one database any more, it's not unlikely you're going to end up using two or more, for different use cases.

Polyglot persistence is the future. If there's one thing I'm certain of, this is it. Don't let anyone fool you into thinking that their database is the only one you'll need, they all have their place. The hard part is to figure out what place that is. Again, that's up to you to find out. People ask me for particular use cases for non-relational databases, but honestly, there is no real distinction. Without knowing the tools, you'll never find out what the use cases are. Other people can just give you ideas, or talk about how they're using the tools, they can't draw the line for you.

Back to the Future

You shouldn't think of it as something totally new, document databases just don't hide these things from you. Lots of the things I mentioned here are things you should be doing anyway, no matter if you're using a relational or a non-relational data store. They should be common sense really. We're not trying to repeat what went wrong in history, we're learning from it.

If there's one thing you should do, it's to start playing with one of the new tools immediately. I shouldn't even be telling you this, since you should hone your craft all the time, and that includes playing the field and broadening your personal and professional horizon. Only then will you be able to judge what use case is a good fit for e.g. a document database. I'd highly suggest starting to play with e.g. CouchDB, MongoDB, Riak or Redis.

For an article in a German magazine I've been researching MongoDB over the last week or so. While I didn't need a lot of the information I came across I collected some nicely distilled notes on some of its inner workings. You won't find information on how to get data out of or into MongoDB. The notes deal with the way MongoDB treats and handles your data, a high-low-level view if you will. I tried to keep them as objective as possible, but I added some commentary below.

Most of this is distilled knowledge I gathered from the MongoDB documentation, credit for making such a good resource available for us to read goes to the Mongo team. I added some of my own conclusion where it made sense. They're doing a great job documenting it, and I can highly recommend spending time to go through as much of it as possible to get a good overview of the whys and hows of MongoDB. Also, thanks to Mathias Stearn for hooking me up with some more details on future plans and inner workings in general. If you want to know more about its inner workings, there's a webcast coming up where they're gonna explain how it works.

Basics

  • Name stems from humongous, though (fun fact) mongo has some unfortunate meanings in other languages than English (German for example)
  • Written in C++.
  • Lots of language drivers available, pushed and backed by the MongoDB team. Good momentum here.
  • According to The Changelog Show ([1]) MongoDB was originally part of a cloud web development platform, and at some point was extracted from the rest, open sourced and turned into what it is today.

Collections

  • Data in MongoDB is stored in collections, which in turn is stored in databases. Collections are a way of storing related data (think relational tables, but sans the schema). Collections contain documents which have in turn keys, another name for attributes.
  • Data is limited to around 2 GB on 32-bit systems, because MongoDB uses memory-mapped files, as they're tied to the available memory addressing. (see [2])
  • Documents in collections usually have a similar data structure, but any arbitrary kind of document could be stored, similarity is recommended for index efficiency. Document's can have a maximum size of 4MB.
  • Collections can be namespaced, i.e. logically nested: db.blog.posts, but the collection is still flat as far as MongoDB is concerned, purely an organizational means. Indexes created on a namespaced collection only seem to apply to the namespace they were created on though.
  • A collection is physically created as soon as the first document is created in it.
  • Default limit on number of namespaces per database is 24000 (includes all collections as they're practically the top level namespace in a database), which also includes indexes, so with the maximum of 40 indexes applied to each collection you could have 585 collections in a database. The default can be changed of course, but requires repairing the database if changed on an active instance.
  • While you can put all your data into one single collection, from a performance point of view, it seems to make sense to separate them into different collections, because it allows MongoDB to keep its indexes clean, as they won't index attributes for totally unrelated documents.

Capped Collections

  • Capped collections are fixed-size collections that automatically remove aged entries by LRU. Sounds fancier than it probably is, I'm thinking that documents are just appended at the last writing index, which is reset to 0 when limit of the collection is reached. Preferrable for insert-only use cases, updates of existing documents fail when the data size is larger than before the update. This makes sense because moving an object would destroy the natural insertion order. Limited to ~1GB on 32-bit systems, sky's the limit on 64-bit.
  • Capped collections seem like a good tool for logging data, well knowing that old data is purged automatically, being replaced with new data when the limit is reached. Documents can't be deleted, only the entire collection can be dropped. Capped collections have no indexes on the _id by default, ensuring good write performance. Indexes generally not recommended to ensure high write performance. No index on _id means that walking the collection is preferred over looking up by a key.
  • Documents fetched from a capped collection are returned in the order of their insertion, newest first, think log tailing.

Data Format

  • Data is stored and queried in BSON, think binary-serialized JSON-like data. Features are a superset of JSON, adding support for regular expressions, date, binary data, and their own object id type. All strings are stored in UTF-8 in BSON, sorting on the other hand does not (yet), it uses strcmp, so the order might be different from what you'd expect. There's a sort of specification for BSON, if you're into that kind of stuff: [3] and [4]
  • Documents are not identified by a simple ID, but by an object identifier type, optimized for storage and indexing. Uses machine identifier, timestamp and process id to be reasonably unique. That's the default, and the user is free to assign any value he wishes as a document's ID.
  • MongoDB has a "standard" way of storing references to other documents using the DBRef type, but it doesn't seem to have any advantages (e.g. fetch associated objects with parent) just yet. Some language drivers can take the DBRef object and dereference it.
  • Binary data is serialized in little-endian.
  • Being a binary format, MongoDB doesn't have to parse documents like with JSON, they're a valid in-memory presentation already when coming across the wire.

References

  • Documents can embed a tree of associated data, e.g. tags, comments and the like instead of storing them in different MongoDB documents. This is not specific to MongoDB, but document databases in general (see [5]), but when using find you can dereference nested objects with the dot, e.g. blog.posts.comments.body, and index them with the same notation.
  • It's mostly left to the language drivers to implement automatic dereferencing of associated documents.
  • It's possible to reference documents in other databases.

Indexes

  • Every document gets a default index on the _id attribute, which also enforces uniqueness. It's recommended to index any attribute that's being queried or sorted on.
  • Indexes can be set on any attribute or embedded attributes and documents. Indexes can also be created on multiple attributes, additionally specifying a sort order.
  • If an array attribute is indexed, MongoDB will indexed all the values in it (Multikeys).
  • Unique keys are possible, missing attributes are set to null to ensure a document with the same missing attribute can only be stored once.
  • If it can, MongoDB will only update indexes on keys that changed when updating a document, only if the document hasn't changed in size so much that it must be moved.
  • MongoDB up to 1.2 creates and updates synchronously, 1.3 has support to update indexes in the background

Updates

  • Updates to documents are in-place, allowing for partial updates and atomic operations on attributes (set for all attributes, incr, decr on numbers, push, pop, pull et. al on arrays), also known as modifier operations. If an object grows out of the space originally allocated for it, it'll be moved, which is obviously a lot slower than updating in-place, since indexes need to be updated as well. MongoDB tries to adapt by allocating based on update history (see [6]). Writes are lazy.
  • Not using any modifier operation will result in the full document being updated.
  • Updated can be done with criteria, so a whole bunch of matching documents. Think "update ... where" in SQL. This allows for updating objects based on a particular snapshot, i.e. update based on id and some value in the criteria will only update when the document still has that value. This kind of update is atomic. Reliably updating multiple documents atomically (think transaction) is not possible. There's also findAndModify in 1.3 (see [7]) which allows atomically updating and returning a document.
  • Upserts insert when a record with the given criteria doesn't exist, otherwise updates the found record. They're executed on the collection. A normal save() will do that automatically for any given document. Think find_or_create_by in ActiveRecord.

Querying

  • Results are returned as cursors, walking a collection as it advances. Which explains why you potentially get records that needed to be moved, it pops up in a space that's potentially after its current position, if there's space even in a spot before the current cursor's position. Cursors are fetched in batches of 100 documents or 4 MB of data, whichever's reached first.
  • That's also why it's better to store similar data in a separate collection. Traversing similar data is cheaper than traversing over totally unrelated data, the bigger the size of documents compared to the documents that match your find, the more data will have to be fetched from the database and skipped if it doesn't match your criteria.
  • Data is returned in natural order which doesn't necessarily relate to insertion order, as data can be moved if it doesn't fit into its old spot anymore when updated. For capped collections, natural order is always insertion order.

Durability

  • By default, data in MongoDB is flushed to disk every 60 seconds. Writes to MongoDB (i.e. document creates, updates and deletes) are not stored on disk until the next sync. Tradeoff high write performance vs. durability. Need more durability, reduce sync delay. Closest comparison to the durability behaviour is MySQL's MyISAM.
  • Data is not written transactional, so if the server is killed during a write operation, the data is likely to be inconsistent or even corrupted and needs repair. Think classic file systems like ext2 or MyISAM.
  • In MongoDB 1.3 a database flush to disk can be enforced by sending the fsync command to the server.

Replication

  • Replication is the recommended way of ensuring data durability and failover in MongoDB. A new (i.e. bare and dataless) instance can be hooked onto another at any time, doing an initial cloning of all data, fetching only updates after that.
  • Replica pairs offers an auto-failover mechanism. Initially both settle on which is master and which is slave, the slave taking over should the master go down. Can be used e.g. in the Ruby driver using :left and :right options. There's an algorithm to handle changes when master and slave get out of sync, but it's not fully obvious to me (see [8]). Replica Pairs will be replaced by Replica Sets, allowing for more than one slave. The slave with the most recent data will be promoted master in case of the master going down. The slaves agree which one of them is the new master, so a client could ask any one server in the set which one of them is the master.
  • Replication is asynchronous, so updates won't propagate immediately to the slaves. There's ideas to require the right to be propagated to at least N slaves before returning the write to the client successfully (similar to the feature in MySQL 5.4). (see [9])
  • A master collects its writes in an opslog on which the slaves simply poll for changes. The opslog is a capped collection and therefore not a fully usable transaction log (not written to disk?) as old data is purged automatically, hence not reliable for restoring the database after a crash.
  • After initial clone, slaves poll once on the full opslog, subsequent polls remember the position where the previous poll ended.
  • Replication is not transactional, so the durability of the data on the slave is prone to the same durability conditions as the master, just in a different and still durability-increasing manner, since having a slave allows to decrease sync times on it, and therefore shortening the timespan of data not being written to disk across the setup.

Caching

  • With the default storage engine, caching is basically handled by the operating system's virtual memory manager, since it uses memory-mapped files. File cache == Database cache
  • Caching behaviour relies on the operating system, and can vary, not necessarily the same on every operating system.

Backup

  • If you can live with a temporary write lock on your database, MongoDB 1.3 offers fsync with lock to take a reliable snapshot of the database's files.
  • Otherwise, take the old school way of dumping the data using mongodump, or snapshotting/dumping from a slave database.

Storage

  • Data is stored in subsequently numbered data files, each new one being larger than the former, 2GB being the maximum size a data file can have.
  • Allocation of new datafiles doesn't seem to be exactly related to the amount of data currently being stored. E.g. storage size returned by MongoDB for a collection was 2874825392 bytes, but it had already created almost six gigabytes worth of database files. Maybe that's the result of padding space for records. I haven't found a clear documentation on this behaviour.
  • When MongoDB moves data into a different spot or deletes documents, it keeps track of the free space to reuse in the future. The command repairDatabase() can be use to compact it, but that's a slow and blocking operation.

Concurrency

  • MongoDB refrains from using any kind of locking on data, it has no notion of a transaction or isolation levels. Concurrent writes will simply overwrite each other's data, as they go straight to memory. Exceptions are modifier operations that are guaranteed to be atomic. As there is no way to update multiple records in some sort of transaction, optimistic locking is not possible, at least in a fully reliable way. Since writes are in-place and in-memory first, they're wicked fast.
  • Reads from the database are usually done in cursors, fetching a batch of documents lazily while iterating through it. If records in the cursors are updated while the cursor is being read from, the updated data may or may not show up. There's no kind of isolation level (as there are no locks or snapshotting). Deleted records will be skipped. If a record is updated from another process so that the size increases and the object has to be moved to another spot there's a chance it's returned twice.
  • There's snapshot queries, but even they may or may not return inserted and deleted records. They do ensure that even updated records will be returned only once, but are slower than normal queries.

Memory

  • New data is allocated in memory first, increments seem to be fully related to the amount of data saved.
  • MongoDB seems to be happy to hold on to whatever memory it can get, but at least during fsync it frees as much as possible. Sometimes it just went back to consuming about 512 MB real memory, other times it went down to just a couple of megs, I couldn't for the life of me make out a pattern.
  • When a new database file needs to be created, it looks like MongoDB is forcing all data to be flushed to disk, freeing a dramatic amount of memory. On normal fsyncs, there's no real pattern as to how MongoDB frees memory.
  • It's not obvious how a user can configure how much memory MongoDB can or should use, I guess it's not possible as of now. Memory-mapped files probably just use whatever's available, and be cleaned up automatically by the operating system's virtual memory system.
  • The need to add an additional caching layer is reduced, as object and database representation is the same, and file system and memory cache can work together to speed up access, there's no data conversion involved, at least not on MongoDB's side, data will just be sent serialized and unparsed across the wire. Obviously it depends on the use case if this is really an advantage or a secondary caching layer is still needed.

GridFS

  • Overcomes the 4MB limit on documents by chunking larger binary objects into smaller bits.
  • Can store metadata alongside file data. Metadata can be specified by the user and be arbitrary, e.g. contain access control information, tags, etc.
  • Chunks can be randomly access, so it's possible to fetch data easily whose position in the file is well-known. If random access is required, makes sense to keep chunks small. Default size is 256K.

Protocol Access

  • MongoDB's protocol is binary and in its own right proprietary, hence they offer a lot of language drivers to take that pain away from developers, but also offer a full specification on both BSON and the protocol.

Sharding

  • MongoDB has alpha support for sharding. Its functionality shouldn't be confused with Riak's way of partitioning, it's a whole different story. The current functionality is far from what is planned for production, so take everything listed here with a grain of salt, it merely presents the current state. The final sharding feature is supposed to be free of the restrictions listed here.
  • A shard ideally (but not necessarily) consists of two servers which form a replica pair, or a replica set in the future.
  • All shards are known to a number of config server instances that also know how and where data is partitioned to.
  • Data can be sharded by a specific key. That key can't be changed afterwards, neither can the key's value.
  • Keys chosen should be granular enough so that there's the potential of having too many records with the same key. Data is split into chunks of 50 MB so with big documents, it's probably better to store them in GridFS, as a chunk can contain a minimum of ~12 documents when all take up the available space of 4 MB.
  • Sharding is handled by a number of mongos instances which are connected to the shards which in turn are all known to a number of mongod config server instances. These can run on the same machines as the data-handling mongod instances, with the risk that when the servers go down they also disappear. Having backup services seems to be appropriate in this scenario.
  • Sharding is still in alpha, e.g. currently replicated shards aren't supported in alpha 2, so a reliable sharding setup is currently not possible. If a shard goes down, the data on it is simply unavailable until it's brought back up. Until that happens, all reads will raise an error, even when looking up data that's known to be on the still available shards.
  • There's no auto-balancing to move chunks to new shards, but that can be done manually.

There you go. If you have something to add or to correct, feel free to leave a comment. I'm happy to stand corrected should I have drawn wrong conclusions anywhere.

As a user of CouchDB I gotta say, I was quite sceptical about some of MongoDB's approaches of handling data. Especially durability is something that I was worried about. But while I read through the documentation and played with MongoDB I realized that it's the same story as always: It depends. It's a problem when it's a problem. CouchDB and MongoDB don't necessarily cover the same set of use cases. There's enough use cases where the durability approach of MongoDB is acceptable compared to what you gain, e.g. in development speed, or speed when accessing data, because holy crap, that stuff is fast. There's a good reason for that, as I hope you'll agree after going through these notes. I'm glad I took the time to get to know it better, because the use cases kept popping up in my head where I would prefer it over CouchDB, which isn't always a sweet treat either.

If you haven't already, do give MongoDB a spin, go through their documentation, throw data at it. It's a fun database, and the entrance barrier couldn't be lower. It's a good combination of relational database technologies, with schemaless and JavaScript sprinkled on top.

Tags: nosql, mongodb

Amazon's recent release of DynamoDB, a database whose name is inspired by Dynamo, the key-value database the distributed datastore they've been running in production for a good five to six years now. I think it's great they've finally done it, though from my obverservations, there's little resemblance of what the original Dynamo paper describes, but I'm getting ahead of myself. Traditionally Amazon hasn't been very open about how they implement their services, so some of what I'm stating here may be nothing more than an educated guess. Either way, the result is pretty neat.

Time to take a good look at what it has to offer, how that works out in code, and to make some wild guesses as to what's happening under the covers. I'm using fog's master to talk to DynamoDB in the examples below, but the official Ruby SDK also works. Fog is closer to the bare API metal though, which works out well for our purposes.

My goal is not to outline the entire API and its full set of options, but to dig into the bits most interesting to me and to show some examples. A lot of the focus in other posts is on performance, scalability and operational ease. I think that's a great feature of DynamoDB, but it's pretty much the same with all of their web services. So instead I'm focusing on the effects DynamoDB has on you, the user. We'll look at API, general usage, data model and what DynamoDB's feature generally entails.

The Basics

DynamoDB is a distributed database in the cloud, no surprises, it's not the first such thing Amazon has in its portfolio. S3, SimpleDB, RDS and DynamoDB all provide redudant ways to store different types of data in Amazon's datacenters. Currently, DynamoDB only supports the us-east region.

An important difference is that data stored in DynamoDB is officially stored on SSDs, which has (or at least should have) the benefit of offering predictable performance and greatly reduced latency across the board. The question that remains is, of course: when can I hook up SSDs to my EC2 instances?

The other big deal is that the read and write capacity available to you is configurable. You can tell Amazon how much capacity, in read and write units per second, you expect for your application, and they make sure that capacity is available to you as long as you need it (and pay for it).

Data Model

Data in DynamoDB is stored in tables, a logical separation of data if you will. Table names are only unique on a per-user basis, not globally like S3 buckets.

Tables store items, think rows of data or documents. Items have attributes and values, similar to SimpleDB. Think of it as a JSON document where you can read and write attributes independent of the entire document. Every row can have different attributes and values, but every item needs to have a uniquely identifying key whose name you decide on upfront, when you create the table.

Let's create a table first, but make sure you wear protection, as this is yet another of Amazon's gross-looking APIs. The relevant API action is CreateTable.

dynamo = Fog::AWS::DynamoDB.new(aws_access_key_id: "YOUR KEY",
  aws_secret_access_key: "YOUR SECRET")

dynamo.create_table("people",
  {HashKeyElement: {AttributeName: "username", AttributeType: "S"}},
  {ReadCapacityUnits: 5, WriteCapacityUnits: 5})

This creates a table called "people" with an attribute "username" as the indexing key. This is the key you're using to reference a row, sorry, an item, in a table. The key is of type string, hence the S, you can alternatively specify an N for numeric values. This pattern will come up with every attribute and value pair you store, you always need to tell DynamoDB if it's a string or a number. You also specify initial values for expected read and write capacity, where 5 is the minimum.

You have to provide a proper value for a key, DynamoDB doesn't offer any automatic generation of keys for you, so choose them wisely to ensure a good distribution. A key, that has a lot of data behind it, and that's accessed much more frequently than others will not benefit you. Keep your data simple and store it with multiple keys if you can.

Anyhoo, to insert an item, you can use the PutItem API action, to which you issue a HTTP POST (go figure). The DynamoDB API already gives me headache, but more on that later. Thankfully, a good client library hides the terrible interface from us, only leaving us with the hash structures sent to DynamoDB.

dynamo.put_item("people", {username: {"S" => "roidrage"}})

Specify multiple attributes as separate hash entries, each pointing to a separate hash specifying data type and value. I'll leave it to you to think about how backwards this is, though it must be said that JavaScript is also to blame here, not handling 64 bit integers very well.

You can store lists (or rather sets of data) too, using SS as the datatype.

dynamo.put_item("people", {username: {S: "roidrage"},
                               tags: {SS: ["nosql", "cloud"]}})

Note that you can use PutItem on existing items, but you'll always replace all existing attributes and values.

Conflicts and Conditional Writes

All writes can include optional conditions to check before updating an item. That way you can ensure the item you're writing is still in the same state as your local copy of the data. Think: read, update some attribute, then write, expecting and ensuring to not have any conflicting updates from other clients in between.

This is a pretty neat feature, you can base updates on one attributes based on whether another attribute exists or has a certain value.

dynamo.put_item("people",
    {username: {S: "roidrage"}, fullname: {S: "Mathias Meyer"}},
    {Expected: {fullname: {Exists: false}}})

This operation is only executed if the item as currently stored in DynamoDB doesn't have the attribute the attribute fullname yet. On a subsequent PutItem call, you could also check if the item's fullname attribute still has the same value, e.g. when updating the full name of user "roidrage".

dynamo.put_item("people",
    {username: {S: "roidrage"}, fullname: {S: "Mathias Meyer"}},
    {Expected: {fullname: {Value: {S: "Mathias Meyer"}}}})

It's not the greatest syntax for sure, but it does the job. Which brings me to some musings about consistency. If the condition fails, Amazon returns a 400 response, just like with every other possible error you could cause.

To make proper use of conditional updates in your application and to actually prevent conflicting writes and lost updates, you should use the UpdateItem action instead and only update single attributes, as PutItem always replaces the entire item in the database. But even then, make sure to always reference the right attributes in an update. You could even implement some sort of versioning scheme on top of this, for instance to emulate multi-version concurrency control.

Updating single or multiple attributes is done with the UpdateItem action. You can update attributes without affecting others and add new attributes as you see fit. To add a new attribute street, here's the slightly more verbose code.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
    {street: {Value: {S: "Gabriel-Max-Str. 3"}}})

There are more options involved than with PutItem, and there are several more available for UpdateItem. But they'll have to wait for just a couple of sentences.

Consistency on Writes

Conditional updates are the first hint that DynamoDB is not like Dynamo at all, as they assume some sort of semi-transactional semantics. Wherein a set of nodes agree on a state (the conditional expression) and then all apply the update. The same is true for atomic counters, which we'll look at in just a minute.

From the documentation it's not fully clear how writes without a condition or without an atomic counter increment are handled, or what happens when two clients update the same attribute at the same time, and who wins based on which condition. Facing the user, there is no mechanism to detect conflicting writes. So I can only assume DynamoDB either lets the last write win or has a scheme similar to BigTable, using timestamps for each attribute.

Writes don't allow you to specify something like a quorum, telling DynamoDB how consistent you'd like the write to be, it seems to be up to the system to decide when and how quickly replication to other datacenters is done. Alex Popescu's summary on DynamoDB and Werner Vogels' introduction suggest that writes are replicated across data centers synchronously, but doesn't say to how many. On a wild guess, two data centers would make the write durable enough, leaving the others to asynchronous replication.

Consistency on Reads

For reads on the other hand, you can tell DynamoDB if stale data is acceptable to you, resulting in an eventually consistent read. If you prefer strongly consistent reads, you can specify that on a per-operation basis. What works out well for Amazon is the fact that strongly consistent reads cost twice as much as eventually consistent reads, as more coordination and more I/O are involved. From a strictly calculating view, strongly consistent write take up twice as much capacity as eventually consistent writes.

From this I can only assume that writes, unless conditional will usually be at least partially eventually consistent or at least not immediately spread out to all replicas. On reads on the other hand, you can tell DynamoDB to involve more than just one or two nodes and reconcile outdated replicas before returning it to you.

Reading Data

There's not much to reading single items. The action GetItem allows you to read entire items or select single attributes to read.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}})

Optionally, add the consistency option to get strong consistency, with eventual consistency being the default.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}},
                          {ConsistentRead: true})

A read without any more options always returns all data, but you can select specific attributes.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}},
                          {AttributesToGet: ["street"]})

Atomic Counters

Atomic counters are a pretty big deal. This is what users want out of pretty much every distributed database. Some say they're using it wrong, but I think they're right to ask for it. Distributed counters are hard, but not impossible.

Counters need to be numerical fields, and you can increment and decrement them using the UpdateItem action. To increment a field by a value, specify a numerical attribute, the value to be incremented by, and use the action ADD. Note that the increment value needs to be a string.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {logins: {Value: {N: "1"}, Action: "ADD"}})

When the counter attribute doesn't exist yet, it's created for you and set to 1. You can also decrement values by specifying a negative value.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {logins: {Value: {N: "-1"}, Action: "ADD"}})

Can't tell you how cool I think this feature is. Even though we keep telling people that atomic counters in distributed system are hard, as they involve coordination and increase vulnerability to failure of a node involved in an operation, systems like Cassandra and HBase show that it's possible to do.

Storing Data in Sets

Other than numerical (which need to be specified as strings nonetheless) and string data types, you can store sets of both too. One member can only exist once in an attribute set. The neat part is that you can atomically add new members to sets using the UpdateItem action. In the JSON format sent to the server, you always reference even just single members to add as a list of items. Here's an example:

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["nosql"]}}})

That always replaces the existing attribute though. To add a member you need to specify an action. If the member already exists, nothing happens, but the request still succeeds.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["cloud"]}, Action: "ADD"}})

You can delete elements in a similar fashion, by using the action DELETE.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["cloud"]}, Action: "DELETE"}})

The default action is PUT, which replaces the listed attributes with the values specified. If you run PUT or ADD on a key that doesn't exist yet, the item will be automatically created. This little feature and handling of sets and counters in general sounds a lot like things that made MongoDB so popular, as you can atomically push items onto lists and increment counters.

This is another feature that's got me thinking about whether DynamoDB even includes the tiniest bit of the original Dynamo. You could model counters and sets based on something like Dynamo for sure, based on the ideas behind Commutative Replicated Data Types. But I do wonder if Amazon actually did go through all the trouble building that system on top of the traditional Dynamo, or if they implemented something entirely new for this purpose. There is no doubt that operations like these is what a lot of users want even from distributed databases, so either way, they've clearly hit a nerve.

Column Store in Disguise?

The fact that you can fetch and update single attributes with consistency guarantees makes me think that DynamoDB is actually more like a wide column store like Cassandra, HBase or, gasp, Google's BigTable. There doesn't seem to be anything left from the original, content-agnostic idea of the Dynamo data store, whose name DynamoDB so nicely piggybacks on.

The bottom line is there's always a schema of attributes and values for a particular item you store. What you store in an attribute is up to you, but there's a limit of 64KB per item.

DynamoDB assumes there's always some structure in the data you're storing. If you need something content-agnostic to store data but with similar benefits (replication, redundancy, fault-tolerance), use S3 and CloudFront. Nothing's keeping you from using several services obviously. If I used DynamoDB for something it'd probably not be my main datastore, but an add-on for data I need to have highly available and stored in a fault-tolerant fashion, but that's a matter of taste.

A Word on Throughput Capacity

Whereas you had to dynamically add capacity in self-hosting database systems, always keeping an eye on current capacity limits, you can add more capacity to handle more DynamoDB request per second simply by issuing an API call.

Higher capacity means paying more. To save money, You could even adjust the capacity for a specific time of day basis, growing up and down with your traffic. If you go beyond your configured throughput capacity, operations may be throttled.

Throughput capacity is based on size of items you read and the number of reads and writes per second. Reading one item with a size of 1 KB corresponds to one read capacity unit. Add more required capacity units with increased size and number of operations per second.

This is the metric you always need to keep an eye on and constantly measure from your app, not just to validate invoice Amazon sends you at the end of the month, but also to track your capacity needs at all times. Luckily you can track this using CloudWatch and trigger alerts. Amazon can trigger predefined alerts when capacity crosses a certain threshold. Plus, every response to a read includes the consumed read capacity, and the same is true for writes.

Throughput capacity pricing is pretty ingenious on Amazon's end. You pay for what you reserve, not for what you actually use. As you always have to have more capacity available as you currently need, you always need to reserve more in DynamoDB. But if you think about it, this is exactly how you'd work with your own hosted, distributed database. You'll never want to work very close to capacity, unless you're some crazy person.

Of course you can only scale down throughput capacity once per day, but scale up as much as you like, and increases need to be done at least 10%. I applaud you for exploiting every possible opportunity to make money, Amazon!

Data Partitioning

Amazon's documentation suggests that Amazon doesn't use random partitioning to spread data across partitions, partitioning is instead done per table. Partitions are created as you add data and as you increase capacity, suggesting either some sort of composite key scheme or a tablet like partitioning scheme, again, similar to what HBase or BigTable do. This is more fiction than fact, it's an assumption on my part. A tablet-like approach certainly would make distributing and splitting up partitions easier than having a fixed partitioning scheme upfront like in Dynamo.

The odd part is that Amazon actually makes you worry about partitions but doesn't seem to offer any way of telling you about them or how your data is partitioned. Amazon seems to handle partitioning mostly automatically and increases capacity by spreading out your data across more partitions as you scale capacity demand up.

Range Keys

Keys can be single values or based on a key-attribute combination. This is a pretty big deal as it effectively gives you composite keys in a distributed database, think Cassandra's data model. This effectively gives you a time series database in the cloud, allowing you to store sorted data.

You can specify a secondary key on which you can query by a range, and which Amazon automatically indexes for you. This is yet another feature that makes DynamoDB closer to a wide column store than the traditional Dynamo data store.

The value of the range key could be an increasing counter (though you'd have to take care of this yourself), a timestamp, or a time based UUID. Of course it could be anything else and unique entirely, but time series data is just a nice example for range keys. The neat part is that this way you can extract data for specific time ranges, e.g. for logging or activity feeds.

We already looked at how you define a normal hash key, let's look at an example with a more complex key combining a hash key and a range key, using a numerical type to denote a timestamp.

dynamo.create_table("activities", {
      HashKeyElement: {AttributeName: "username", AttributeType: "S"},
      RangeKeyElement: {AttributeName: "created_at", AttributeType: "N"}},
    {ReadCapacityUnits: 5, WriteCapacityUnits: 5})

Now you can insert new items based on both the hash key and a range key. Just specify at least both attributes in a request.

dynamo.put_item("activities", {
    username: {S: "roidrage"}, 
    created_at: {N: Time.now.tv_sec.to_s},
    activity: {S: "Logged in"}})

The idea is simple, it's a timestamp-based activity feed per user, indexed by the time of the activity. Using the Query action, we can fetch a range of activities, e.g. for the last 24 hours. Just using get_item, you always have to specify a specific combination of hash and range key.

To fetch a range, I'll have to resort to using Amazon's Ruby SDK, as fog hasn't implemented the Query action yet. That way you won't see the dirty API stuff for now, but maybe that's a good thing.

dynamo = AWS::DynamoDB.new(access_key_id: "YOUR KEY",
                           secret_access_key: "YOUR SECRET")
activites = dynamo["activities"]

items = activities.items.query(
    hash_key: "roidrage",
    range_greater_than: (Time.now - 85600).tv_sec)

This fetches all activity items for the last 24 hours. You can also fetch more historic items by specifying ranges. This example fetches all items for the last seven days.

items = activities.items.query(
    hash_key: "roidrage",
    range_greater_than: (Time.now - 7.days.ago).tv_sec,
    range_less_than: Time.now.tv_sec)

Note that the Query action is only available for tables with composite keys. If you don't specify a range key, DynamoDB will return all items matching the hash key.

Queries using Filters

The only things that's missing now is a way to do richer queries, which DynamoDB offers by way of the Scan action. fog doesn't have an implementation for this yet, so we once again turn to the AWS Ruby SDK.

Scanning allows you to specify a filter, which can be an arbitrary number of attribute and value matches. Once again the Ruby SDK abstracts the ugliness of the API underneath into something more readable.

activities.items.where(:activity).begins_with("Logged").each do |item|
  p item.attributes.to_h
end

You can include more than one attribute and build arbitrarily complex queries, in this example to fetch only items related to the user "roidrage" and him logging in.

activities.items.where(:activity).equals("Logged in").
          and(:username).equals("roidrage").each do |item|
  p item.attributes.to_h
end

You can query for ranges as well, combining the above with getting only items for the last seven days.

activities.items.where(:activity).equals("Logged in").
    and(:username).equals("mathias").
    and(:created_at).between(7.days.ago.tv_sec, Time.now.tv_sec).each {|item|
  p item.attributes.to_h
end

Filters fetch a maximum of 1 MB of data. As soon as that's accumulated, the scan returns, but also includes the last item evaluated, so you can continue where the query left off. Interestingly, you also get the number of items remaining. Something like paginating results dynamically comes to mind. Running a filter is very similar to running a full table scan, though it's rather efficient thanks to SSDs. But don't expect single digit millisecond responses here. As scans heavily affect your capacity throughput, you're better off resorting to their use only for ad-hoc queries, not as a general part of your application's workflow.

Unlike Query and GetItem, the Scan action doesn't guarantee strong consistency, and there's no flag to explicitly request it.

The API

DynamoDB's HTTP API has got to be the worst ever built at Amazon, you could even think it's the first API every designed at Amazon. You do POST requests even to GET data. The request and response can include conditions and validations and their respective errors, the proper Java class name of an error and other crap. Not to mention that every error caused by any kind of input on your end always causes a 400.

Here's an example of a simplified request header:

POST / HTTP/1.1
Content-Type: application/x-amz-json-1.0
x-amz-target: DynamoDB_20111205.GetItem

And here's a response body from an erroneous request:

{"__type":"com.amazon.coral.validate#ValidationException",
"message":"One or more parameter values were invalid:
The provided key size does not match with that of the schema"}

Lovely! At least there's a proper error message, though it's not always telling you what the real error is. Given that Amazon's documentation is still filled with syntactical errors, this is a bit inconvenient.

The API is some bastard child of SOAP, but with JSON instead of XML. Even using a pretty low level library like fog doesn't hide all the crap from you. Which worked out well in this case, as you see enough of the API to get an idea about its basic workings.

The code examples above don't read very Ruby like as I'm sure you'll agree. Though I gotta say, the Ruby SDK provided by AWS feels a lot more like Ruby in its interface.

I don't have very high hopes to see improvements on Amazon's end, but who knows. S3, for example, got a pretty decent REST API eventually.

Pricing

Pricing is done (aside from throughput capacity) per GB stored. If you store millions of items, and they all exceed size of just a few KB, expect to pay Amazon a ton of money, storage pricing trumps throughput pricing by a lot. Keep data stored in an item small. If you only store a few large items, it works too, but you may end up being better off choosing one of Amazon's other storage options. You do the math. Pricing for storage and the maximum size for a single item always includes attribute names, just like with SimpleDB.

To give you an idea how pricing works out, here's a simple calculation. 100 GB of data, 1000 reads per second, 200 writes per second, item size is 4 KB on average. That's $1253.15 every month, not including traffic. Add 10 GB of data the second month and you're at $1263.15. You get the idea. Pricing is much more affected by read and write operations vs. item size. Make your items 6 KB in size, and you're already at $1827.68.

Bottom Line

Though Amazon is doing a pretty good job at squeezing the last drop of money out of their customers using DynamoDB, think about what you're getting in return. No operations, no hosting facilities, let alone in three different datacenters, conditional writes and atomic counters, and a database that (I assume) has years of experience in production forged into it.

As usually the case with Amazon's Web Services, using something like DynamoDB is a financial tradeoff. You get a data store that scales up and down with your demand without you having to worry about the details of hardware, operations, replication and data performance. In turn you pay for every aspect of the system. The price for storage is likely to go down eventually, making it a more attractive alternative to hosting an open source NoSQL database system yourself. Whether this is an option for your specific use case, only you're able to make that decision.

If you store terrabytes of data, and that data is worth tens of thousands of dollars per month in return for not having to care about hosting, by all means, go for DynamoDB. But at that size, just one or two months of hosting on Amazon pays off buying servers and SSDs for several data centers. That obviously doesn't cover operational costs and personell, but it's just something to think about.

Closing Thoughts

Sorted range keys, conditional updates, atomic counters, structured data and multi-valued data types, fetching and updating single attributes, strong consistency, and no explicit way to handle and resolve conflicts other than conditions. A lot of features DynamoDB has to offer remind me of everything that's great about wide column stores like Cassandra, but even more so of HBase. This is great in my opinion, as Dynamo would probably not be well-suited for a customer-facing system. And indeed, Werner Vogel's post on DynamoDB seems to suggest DynamoDB is a bastard child of Dynamo and SimpleDB, though with lots of sugar sprinkled on top.

Note that it's certainly possible and may actually be the case that Amazon has built all of the above on top of the basic Dynamo ingredients, Cassandra living proof that it's possible. But if Amazon did reuse a lot of the existing Dynamo code base, they hid it really well. All the evidence points to at least heavy usage of a sorted storage system under the covers, which works very well with SSDs, as they make sequential writes and reads nice and fast.

No matter what it is, Amazon has done something pretty great here. They hide most of the complexity of a distributed system from the user. The only option you as a user worry about is whether or not you prefer strong consistency. No quorum, no thinking about just how consistent you want a specific write or read operation to be.

I'm looking forward to seeing how DynamoDB evolves. Only time will tell how big of an impact Amazon's entering the NoSQL market is going to have. Give it a whirl to find out more about it.

Want to know how the original Dynamo system works? Have a look at the Riak Handbook, a comprehensive guide to Riak a distributed database that implements the ideas of Dynamo and adds lots of sugar on top.

I'm happy to be proven wrong or told otherwise about some of my assumptions here, so feel free to get in touch!

Resources

Be sure to read Werner Vogels' announcement of DynamoDB, and Adrian Cockcroft's comments have some good insights on the evolution of data storage at Netflix and how Cassandra, SimpleDB and DynamoDB compare.

Tags: nosql, dynamo, amazon

Interested in Redis? You might be interested in the Redis Handbook I'm currently working on.

I'm gonna eat my own dog food here, and start you off with a collection of links and ideas of people using Redis. Redis' particular way of treating data requires some rethinking how to store your data to benefit from speed, atomicity and its data types. I've already written about Redis in abundance, this post's purpose is to compliment them with real-world scenarios. Maybe you can gather some ideas on how to deal with things.

There's a couple of well-known use cases already, the most popular of them being Resque, a worker queue. RestMQ, an HTTP-based worker queue using Redis, was just recently released too. Both don't make use yet of the rather new blocking pop commands like Redactor does, so there's still room for improvement, and to make them even more reliable.

Ohm is a library to store objects in Redis. While I'm not sure I'd put this layer of abstraction on top of it, it's well worth looking at the code to get inspiration. Same is true for redis-types.

Redis' simplicity, atomicity and speed make it an excellent tool when tracking things directly from the web, e.g. through WebSockets or Comet. If you can use it asynchronously, all the better.

  • Affiliate Click Tracking with Rack and Redis.

    Simple approach to tracking clicks, I probably wouldn't use a list for all clicks, but instead have one for each path, but there's always several ways to get to your goal with Redis. Not exactly the same, but Almaz can track URLs visited by users in Rails applications.

    Update: Turns out that the affiliate click tracking code above, the list is only used to push clicks into a queue, where they're popped off and handled by a worker, as pointed out by Kris in the comments.

  • Building a NLTK FreqDist on Redis

    Calculation of frequency distribution, with data stored in Redis.

  • Gemcutter: Download Statistics

    The RubyGems resource par excellence is going to use Redis's sorted sets to track daily download statistics. While just a proposals, the ideas are well applicable to all sorts of statistics being tracked in today's web applications.

  • Usage stats and Redis

    More on tracking views statistics with Redis.

  • Vanity - Experiment Driven Development

    Split testing tool based on Redis to integrate in your Rails application. Another kind of tracking statistics. If you didn't realize it up to now, Redis is an excellent tool for this kind of application. Data that you wouldn't want to load off to your main database, because let's face it, it's got enough crap to do already.

  • Flow Analysis & Time-based Bloom Filters

    Streaming data analysis for the masses.

  • Crowdsourced document analysis and MP expenses

    While being more prose than code, it still shows areas where Redis is a much better choice than e.g. MySQL.

Using Redis to store any suitable kind of statistics is pretty much an immediate use case for a lot of web applications. I could think of several projects I've work on that could gain something from using certain parts of their application to Redis. It's the kind of data you just don't want to clutter your database with. Clicks, view, history and all that stuff puts an unnecessary amount of data and load on it. The more data it accumulates, the harder it will be to get rid off, especially in MySQL.

It's not hard to tell that we're still far from having heaps of inspiration and real-life use cases to choose from, but these should give you an idea. If you want it can get a lot simpler too. When you're using Redis already, it makes sense to use it for storing Rails sessions.

Redis is a great way to share data between different processes, be it Ruby or something else. The atomic access to lists, strings and sets, together with speedy access ensures that you don't even need to worry about concurrency issues when reading and writing data. On Scalarium, we're using it mostly for sharing data between processes.

E.g., all communication between our system and clients on the instances we boot for our users is encrypted and signed. To ensure that all processes have access to the keys, they're stored conveniently in Redis. Even though that means the data is duplicated from our main database (which is CouchDB if you must know), access to Redis is a lot faster. We keep statistics about the instances in Redis too, because CouchDB is just not made for writing heaps and heaps of data quickly. Redis also tracks a request token that is used to authenticate internal requests in our asynchronous messaging system, to make sure that they can't be compromised from some external source. Each request gets assigned a unique token. The token is stored in Redis before the message is published and checked before the message is consumed. That way we turned Redis into a trusted source for shared data between web and worker processes.

The library memodis makes sharing data incredibly easy, it offers Redis-based memoization. When you assign a memodis'd attribute in your code, it'll be stored in Redis and therefore can be easily read from other processes.

Redis is incredibly versatile, and if you have a real-life Redis story or usage scenario to share, please do.

Tags: nosql, redis