Fork me on GitHub

Update: Added paragraph on the e-commerce platform I'm using.

Several people have asked me about the publishing tool chain I'm using for the Riak Handbook and the NoSQL Handbook. As Avdi just published his, now's as good a good time as any to throw in mine too. After all, it's always nice to know all the options you have.

All text is written in Markdown. I started out with Textile, but just didn't like Textile's way of breaking down single lines into separate paragraphs, so I switched to Markdown later on. Not a painless process, but sufficiently easy thanks to Pandoc.

The Markdown files, one per chapter, are bundled together into a big file and then fed into GitHub's Redcarpet library. I use Albino and Pygments for syntax highlighting, embedded into a custom renderer for Redcarpet. See the Redcarpet README for an example. Redcarpet also has some MultiMarkdown extensions for tables and other things, so that comes in pretty handy.

To generate the PDF I'm using Prince, a neat little tool that takes HTML and a bunch of CSS and generates a nice PDF. It's a commercial tool, and costs some $500 for a single license. That was what threw me off a tiny bit, but luckily there's a handy service called DocRaptor which is basically a pretty API around Prince, but much much cheaper.

You can use Prince locally to generate infinite test PDFs, and then use DocRaptor to generate the final result. Which is what I do. Before generating the PDF through DocRaptor I upload the generated HTML, which includes all the CSS, to S3, to avoid some issues I've had with Unicode characters in the HTML, and tell DocRaptor to fetch it from a URL instead.

You can use web fonts easily, I went for fonts from Google Web Fonts, which are easily embeddable with Prince and DocRaptor. Prince also allows you to do some customizations specific, stuff like page titles, page numbers, custom settings for left and right pages, footnotes, and the like. See the documentation for a full list.

For ePub generation I generate a stripped-down version of the HTML, not including syntax highlighting, because iBooks doesn't like custom-styled pre or code elements, if you style them iBooks chooses to ignore any font setting and use the default font, which is usually not monospaced.

The HTML then goes through Calibre to generate the ePub file, and then I use kindlegen to generate a Kindle file from that. Calibre has a pretty ugly user interface, but it comes with a set of command line tools you can use to automatically convert ebooks from one format to another.

All of the above is wrapped into a Rakefile and some small Ruby classes. I'll be sure to put it up for your perusal soon, but this list should get you started.

To make the book available for purchase I'm using FastSpring. They're not my favorite choice, as they make things like customizing your shop fairly hard, but I chose them because their way of payout (monthly or bi-monthly), and the fact that they take care of VAT, were both reasons important to me, especially being in Europe, where tax laws throw a lot of stones at you for wanting to try and sell things internationally. Unfortunately FastSpring doesn't have any means of updating products and allowing and notifying users of the updates being available, so I built a small Sinatra app to capture orders and take care of sending out the updates. Not pretty, but it's a small sacrifice compared to the tax hassle I'd have with any other fulfillment provider.

Curious about the result? You should buy the book!

Here's a list of things I've been reading lately or that I'm about to read, and that I found to be worth sharing. If you're looking for something to read over the holidays, I'm happy to give you some suggestions. Books, papers, articles, and videos, something for everyone.

Scalability Rules

A list of 50 rules related to scalability, in an easy to read recipe style. They leave some stuff to the imagination, and I don't agree with every single rule, especially not with the one that demands software should always be easy to rollback, but they give you good food for thought for your own applications.

Time, Clocks, and the Ordering of Events in a Distributed System

The earliest paper (1978!) to mention the notion of clocks as a means to track ordering of events in distributed systems, the predecessor to vector clocks, if you will. A must read.

Harvest, Yield, and Scalable Tolerant Systems

A recap of CAP, making the whole notion of it a bit more flexible by adding tuning knobs for graceful degradation. Hat tip to Coda Hale and his article "You can't sacrifice partition tolerance" for pointing me to this.

Problems with CAP, and Yahoo's little known NoSQL system

Also related to CAP, this article introduces the notion of PACELC, which basically adds latency to the CAP equation. CAP has been criticized quite a few times for being too strict in this regard, and while the name PACELC is a bit odd, the added notion of latency makes a lot of sense.

Replication and the latency-consistency tradeoff

Another one from Daniel Abadi, another one related to CAP, this time talking about replication, consistency, and latency.

It's the latency, stupid!

Going further back in time, this paper talks about latency in all its glory. Sure, it talks about modem speed connections, but extrapolate that into today's network bandwidth and you still have latency. Or you can read the next posts too.

It's Still The Latency, Stupid...pt. 1 and It's Still The Latency, Stupid...pt. 2

A more recent update on latency, because it still matters more than bandwidth.

Crash-only Software

I've been pondering fault-tolerant and cloud-ready systems for a while now, here's one related to the topic, software that crashes as a means to make it more fault-tolerant.

Systems that Never Stop (video)

Great talk by Joe Armstrong, inventor of Erlang, laying down six laws for fault-tolerant systems. All laws lead to Erlang obviously, but it all makes a lot of sense.

Why Do Computers Stop and What Can Be Done About It?

Related to Joe's talk, this paper discusses hardware reduncancy and reliable storage by means of process pairs, modularity and transactions. Have yet to read this one, but going to be interesting thinking about how these ideas, stemming from hardware, apply to software and have been implemented by Erlang.

Working With Unix Processes

A little indie-published ebook on handling Unix processes. Code is focused on Ruby, but most if not all of the book is easily applicable to any other language or a basic Unix environment.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA was an idea for web and application server concurrency based on using queues to condition and handle requests. While the idea has not exactly made it through, I found the model to be strikingly similar to the actor model, in a different way, but still very similar.

A Retrospective on SEDA

SEDA, ten years later, by the author of the original paper. I gotta say, he talks a lot what they got wrong, but I for one think SEDA had a pretty big impact on the bigger picture of web application architecture. Probably something worth discussing in a separate post.

Why Events Are A Bad Idea

A paper comparing threads and events for highly concurrent servers. I'd recommend taking this with a grain of salt. A lot has changed since this paper was written, but what I like about reading papers like this is that it gives you a historic perspective, same for SEDA.

Understanding Virtual Memory

Nice summary of how virtual memory works on Linux.

The Declarative Imperative: Experiences and Conjectures in Distributed Logic

To be honest, this is a slightly confusing paper. It starts out modeling things in an oscure language called Datalog, but then dives into making some conjectures about distributed logic, which was to me the more interesting part.

A brief history of Consensus, 2PC and Transaction Commit

This article is full of gold. An extraordinarily compact view on the topic, but with an abundance of links to papers to dive deeper.

Going to keep posting reading lists like this in the future. So much good stuff to read out there. Lots of great knowledge collected in papers.

Last but not least, why not add the Riak Handbook to your reading list as well?

Tags: reading

The idea of building and storing user timelines (think Twitter) in Riak confused me at first. It sounds like such a spot-on case for time series databases. Yet Yammer managed to make the idea pretty popular. The whole thing lacked implementation though, because they kept their to themselves, which I don't blame them for at all.

Apple Mac Timeline

So let's have a look at how simple it is to build one. You can see a timeline of all Apple Macintosh products above, but that's not what we want to build.

Instead we want to build something like the Twitter timeline. A user follows many other users, and wants to look at a feed built from their activities, so something like the timeline shown below.

Twitter timeline

How do you model a timeline in Riak?

For every user we store one object in Riak. Every timeline contains a list of tweet ids, or whatever activity you're referencing, or it can contain the whole tweets. Something like this should work:

If you want to store more data, turn the list into an array of hashes containing whatever information is necessary to rebuild the timeline later.

Adding entries to the timeline

To add new entries to a timeline, prepend them to the existing list of entries, here's some Ruby code to show how it's done.

The code assumes you take care of followership somewhere else. You can store that data in Riak too, but the code is oblivious to its whereabouts.

Conflicts, siblings, oh my!

The fun starts when two clients update the same timeline, you get a conflict and siblings. The strength of a simple data structure like the example above is that they're easy to merge together while still keeping ordering based on the ids. The ids are ordered only in this example, Twitter somewhat makes sure they are.

When you get a conflict, a smart Riak library like Ripple helps you find out about it. To add on the earlier example, here's a version of add that detects conflicts.

Suddenly you have two or more objects instead of one, each containing a different timeline. To turn them into a single list, you merge all of them together, discard the duplicates, and restore order based on the id. Here's some Ruby to do that.

You iterate over all timeline objects and keep adding unique activities to a new list, returning that when done.

Sort, and done!

All that's left to do is sort the result.

There's the whole code. To spare you the pain of having to write your own library, all this is bundled into a gem called riaktivity. If you're doing Python, the Brett Hoerner has got you covered with timak. Be sure to watch the original talk by Yammer, it's well worth it.

There's ups and downs to this approach, and things that need to be taken care of. More on that and modeling data for Riak in general is covered in the Riak Handbook, the definitive guide on Riak.

Comments: (view/add your own) Tags: riak, nosql

Key rack

One of the most common questions to ask about Riak is: how do I get a list of all the keys in my bucket, in the cluster, or that have an attribute that matches my query using MapReduce?

The motivation behind it is simple: you want to delete all the keys in a bucket, count the amount of keys stored in your cluster entirely, you want to clear out your cluster or you want to run ad-hoc queries on the data stored in a bucket.

All valid in their own right.

But things are not so simple with Riak. To understand why, let's take a quick look under the covers.

What's in a bucket?

A bucket is a namespace in Riak. It's not a physically distinctive entity like a table in a relational database. You can set some properties on it, things like replication levels, commit hooks, quorum, but that's it. Those are stored in the cluster's configuration which is gossiped around the cluster just like the data that identifies what partition goes on which machine.

In fact, when you specify a bucket and a key to fetch or write some data, they're stuck together to find the location in the cluster. Consider a bucket-key combination like users/roidrage. To find the location, Riak hashes both, not just the key. Both bucket and key uniquely identify a piece of data, allowing you to have multiple object with the same key, but in different buckets.

When an object is stored in Riak's storage backends, it uses both bucket and key name to identify it. What you get as a result are files that contain an abundance of different bucket-key combinations and their respective objects, sometimes not even in any order. The only physical distinction Riak has for data on disk is the partition they belong to. Everything else is up to the storage backend. There's no distinction between buckets on disk.

One reason for this is consistent hashing. If you remember the last installment of this series, I mentioned that consistent hashing's downside is that you lose key ordering. Keys are randomly spread out through the cluster. Some ordering still exists depending on the backend, but in general, ordering is lost.

Listing all of the keys

So to list keys in a bucket, Riak has to go through all of the keys in every partition, and I mean ALL OF THEM. Here's a picture of keys and an impersonation of Riak, having to take care of all of them.

No big deal, right? Unless of course, you store millions and millions of them, and want to find about all the keys from say, the bucket users, which may not even have to be more than 1000. To do that, Riak goes through every partition, every partition loads the keys either from memory (Bitcask) or disk (LevelDB) and sifts through them, finding the ones belonging to the users bucket.

All that said, it's certainly not impossible to do, if you have some time to wait, depending on the amount of data stored.

$ curl 'localhost:8098/buckets/users/keys?keys=true'

But wait, don't do that. Do this instead, streaming the keys instead of waiting for them all to arrive and then having them dumped at once.

$ curl 'localhost:8098/buckets/users/keys?keys=stream'

That's much better.

Listing keys has an impact on your Riak nodes, so if you can avoid it, don't do it!

So how do I really get all of the keys?

If select * from riak is not a great option, then what is?

Instead of relying on Riak, build an index on the keys. Thanks to Riak 2i (Secondary Indexes), this is easy. In fact, you get indexing of keys for free when using the LevelDB backend, just use the index $key. This takes advantage of LevelDB's sorted file structure. Neat!

But, and here's the kicker, you can only fetch ranges of keys. So instead of asking for all the keys, you ask for a range large enough to fit all the keys.

$ curl 'localhost:8098/buckets/users/index/$key/0/zzzz'

This finds all the keys that start with something lexicographically larger than 0 and less than zzzz and returns them to you in a list. Now there's a slim chance you'll get users with names like that, but I'll leave that exercise, or proper validations, up to you.

Using that list, you can count the number of keys in that bucket, or you can delete them one by one.

Ideally...

In an ideal world, listing keys in a bucket would be possible and not an expensive operation. Riak could for example allow users to store buckets in separate files. The downside is that with a lot of buckets, you'll hit the limits of open file descriptors in no time, a bit of a bummer. But until something better comes along, secondary indexes are a nice tool to at least avoid resorting to listing all of the keys.

Curious about other ways to index and query data in Riak? You'll like the Riak Handbook, which will be published later this week. Covers Riak's secondary indexes and other strategies to query, inspect and analyze data.

Check back in tomorrow for an introduction on storing timelines in Riak.

Tags: riak, nosql

The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?

You apply a hash function to the key. That's it? Yeah, that's the whole deal of consistent hashing. It's in the name, isn't it?

The same key will always return the same hash code (hopefully), so once you've figured out how you spread out a range of keys across the nodes available, you can always find the right node by looking at the hash code for a key.

It's pretty ingenious, if you ask me. It was cooked up in the lab chambers at Akamai, back in the late nineties. You should go and read the original paper right after we're done here.

Consistent hashing solves the problem people desperately tried to apply sharding to pretty nicely and elegantly. I'm not going to bore you with the details on how exactly consistent hashing works. Mike Perham does a pretty good job at that already, and there are many more blog posts explaining implementations and theory behind it. Also, that little upcoming book of mine has a full-length explanation too. Here's a graphic showing the basic idea of consistent hashing, courtesy of Basho.

Consistent Hashing

Instead I want to look at the practical implications of consistent hashing in distributed databases and cache farms.

Easier to Avoid Hotspots

When you put data on nodes based on a random result, which is what the hash function calculates, a value that's a lot more random than the key it's based on, it's easier to avoid hotspots. Why?

Assume a key based on an increasing value, or a simple range of keys, based on the hour of the day, like 2011-12-11-13. You add new hours and therefore new data as time passes, and keys are stored based on the range they fall in. For example, the keys 2011-12-11-18 until 2011-12-11-23 are stored on the same node, with the rest of the keys stored on other nodes, just because the ranges or the partitioning scheme happen to be set up this way.

For a consumer-facing site, the evening hours are usually the busiest time of the day. They create more data, more writes, and possibly more reads too. For the hours between 18:00 and 23:00, all the load goes to the single node that carries all the relevant data.

But when you determine the location in the cluster based solely on the hash of the key, chances are much higher that two keys lexicographically close to each other end up on different nodes. Thus, the load is shared more evenly. The disadvantage is that you lose the order of keys.

There are partitioning schemes that can work around this, even with a range-based key location. HBase (and Google's BigTable, for that matter) stores ranges of data in separate tablets. As tablets grow beyond their maximum size, they're split up and the remaining parts re-distributed. The advantage of this is that the original range is kept, even as you scale up.

Consistent Hashing Enables Partitioning

When you have a consistent hash, everything looks like a partition. The idea is simple. Consistent hashing forms a keyspace, which is also called continuum, as presented in the illustration. As a node joins the cluster, it picks a random number, and that number determines the data it's going to be responsible for. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node. The resulting partition could be of any size theoretically. It could be a tiny slice, or a large one.

First implementations of consistent hashing still had the problem that a node picking a random range of keys resulted in one node potentially carrying a larger keyspace than others, therefore still creating hotspots.

But the improvement was as simple as it was ingenious. A hash function has a maximum result set, a SHA-1 function has a bit space of 2^160. You do the math. Instead of picking a random key, a node could choose from a fixed set of partitions, like equally size pizza slices. But instead of picking the one with the most cheese on, everyone gets an equally large slice. The number of partitions is picked up front, and practically never changes over the lifetime of the cluster.

For good measure, here's a picture of a sliced pizza.

Consistent Pizza

Partitioning Makes Scaling Up and Down More Predictable

With a fixed number of partitions of the same size, adding new nodes becomes even less of a burden than with just consistent hashing. With the former, it was still unpredictable how much data had to be moved around to transfer ownership of all the data in the range of the new node. One thing's for sure, it already involved a lot less work than previous methods of sharding data.

With partitioning, a node simply claims partitions, and either explicitly or implicitly asks the current owners to hand off the data to them. As a partition can only contain so many keys, and randomness ensures a somewhat even spread of data, there's a lot less unpredictability about the data that needs to be transferred.

If that partitions just so happens to carry the largest object by far in you whole cluster, that's something even consistent hashing can't solve. It only cares for keys.

Going back to HBase, it cares for keys and the size of the tablet the data is stored in, as it breaks up tablets once they reach a threshold. Breaking up and reassigning a tablet requires coordination, which is not an easy thing to do in a distributed system.

Consistent Hashing and Partitioning Enable Replication

Consistent hashing made one thing a lot easier: replicating data across several nodes. The primary means for replication is to ensure data survives single or multiple machine failures. The more replicas you have, the more likely is your data to survive one or more hardware crashes. With three replicas, you can afford to lose two nodes and still serve the data.

With a fixed set of partitions, a new node can just pick the ones it's responsible for, and another stack of partitions it's going to be a replica for. When you really think about it, both processes are actually the same. The beauty of consistent hashing is that there doesn't need to be master for any piece of data. Every node is simply a replica of a number of partitions.

But replication has another purpose besides ensuring data availability.

Replication Reduces Hotspots (Even More!!!)

Having more than one replica of a single piece of data means you can spread out the request load even more. With three replicas of that data, residing on three different nodes, you can now load-balance between them. Neat!

With that, consistent hashing enables a pretty linear increase in capacity as you add more nodes to a cluster.

Consistent Hashing Enables Scalability and Availability

Consistent hashing allows you to scale up and down easier, and makes ensuring availability easier. Easier ways to replicate data allows for better availability and fault-tolerance. Easier ways to reshuffle data when nodes come and go means simpler ways to scale up and down.

It's an ingenious invention, one that has had a great impact. Look at the likes of Memcached, Amazon's Dynamo, Cassandra, or Riak. They all adopted consistent hashing in one way or the other to ensure scalability and availability.

Want to know more about distributed databases in general and Riak in particular? You'll like the Riak Handbook, a hands-on guide full of practical examples and advice on how to use Riak to ensure scalability and availability for your data.

In the next installment we're looking at the consequences and implications of losing key ordering in a Riak cluster.

Tags: nosql, riak
<< Archives | Search | RSS Feed