Here's a list of things I've been reading lately or that I'm about to read, and that I found to be worth sharing. If you're looking for something to read over the holidays, I'm happy to give you some suggestions. Books, papers, articles, and videos, something for everyone.

Scalability Rules

A list of 50 rules related to scalability, in an easy to read recipe style. They leave some stuff to the imagination, and I don't agree with every single rule, especially not with the one that demands software should always be easy to rollback, but they give you good food for thought for your own applications.

Time, Clocks, and the Ordering of Events in a Distributed System

The earliest paper (1978!) to mention the notion of clocks as a means to track ordering of events in distributed systems, the predecessor to vector clocks, if you will. A must read.

Harvest, Yield, and Scalable Tolerant Systems

A recap of CAP, making the whole notion of it a bit more flexible by adding tuning knobs for graceful degradation. Hat tip to Coda Hale and his article "You can't sacrifice partition tolerance" for pointing me to this.

Problems with CAP, and Yahoo's little known NoSQL system

Also related to CAP, this article introduces the notion of PACELC, which basically adds latency to the CAP equation. CAP has been criticized quite a few times for being too strict in this regard, and while the name PACELC is a bit odd, the added notion of latency makes a lot of sense.

Replication and the latency-consistency tradeoff

Another one from Daniel Abadi, another one related to CAP, this time talking about replication, consistency, and latency.

It's the latency, stupid!

Going further back in time, this paper talks about latency in all its glory. Sure, it talks about modem speed connections, but extrapolate that into today's network bandwidth and you still have latency. Or you can read the next posts too.

It's Still The Latency, Stupid...pt. 1 and It's Still The Latency, Stupid...pt. 2

A more recent update on latency, because it still matters more than bandwidth.

Crash-only Software

I've been pondering fault-tolerant and cloud-ready systems for a while now, here's one related to the topic, software that crashes as a means to make it more fault-tolerant.

Systems that Never Stop (video)

Great talk by Joe Armstrong, inventor of Erlang, laying down six laws for fault-tolerant systems. All laws lead to Erlang obviously, but it all makes a lot of sense.

Why Do Computers Stop and What Can Be Done About It?

Related to Joe's talk, this paper discusses hardware reduncancy and reliable storage by means of process pairs, modularity and transactions. Have yet to read this one, but going to be interesting thinking about how these ideas, stemming from hardware, apply to software and have been implemented by Erlang.

Working With Unix Processes

A little indie-published ebook on handling Unix processes. Code is focused on Ruby, but most if not all of the book is easily applicable to any other language or a basic Unix environment.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA was an idea for web and application server concurrency based on using queues to condition and handle requests. While the idea has not exactly made it through, I found the model to be strikingly similar to the actor model, in a different way, but still very similar.

A Retrospective on SEDA

SEDA, ten years later, by the author of the original paper. I gotta say, he talks a lot what they got wrong, but I for one think SEDA had a pretty big impact on the bigger picture of web application architecture. Probably something worth discussing in a separate post.

Why Events Are A Bad Idea

A paper comparing threads and events for highly concurrent servers. I'd recommend taking this with a grain of salt. A lot has changed since this paper was written, but what I like about reading papers like this is that it gives you a historic perspective, same for SEDA.

Understanding Virtual Memory

Nice summary of how virtual memory works on Linux.

The Declarative Imperative: Experiences and Conjectures in Distributed Logic

To be honest, this is a slightly confusing paper. It starts out modeling things in an oscure language called Datalog, but then dives into making some conjectures about distributed logic, which was to me the more interesting part.

A brief history of Consensus, 2PC and Transaction Commit

This article is full of gold. An extraordinarily compact view on the topic, but with an abundance of links to papers to dive deeper.

Going to keep posting reading lists like this in the future. So much good stuff to read out there. Lots of great knowledge collected in papers.

Last but not least, why not add the Riak Handbook to your reading list as well?

Tags: reading

The idea of building and storing user timelines (think Twitter) in Riak confused me at first. It sounds like such a spot-on case for time series databases. Yet Yammer managed to make the idea pretty popular. The whole thing lacked implementation though, because they kept their to themselves, which I don't blame them for at all.

Apple Mac Timeline

So let's have a look at how simple it is to build one. You can see a timeline of all Apple Macintosh products above, but that's not what we want to build.

Instead we want to build something like the Twitter timeline. A user follows many other users, and wants to look at a feed built from their activities, so something like the timeline shown below.

Twitter timeline

How do you model a timeline in Riak?

For every user we store one object in Riak. Every timeline contains a list of tweet ids, or whatever activity you're referencing, or it can contain the whole tweets. Something like this should work:

If you want to store more data, turn the list into an array of hashes containing whatever information is necessary to rebuild the timeline later.

Adding entries to the timeline

To add new entries to a timeline, prepend them to the existing list of entries, here's some Ruby code to show how it's done.

The code assumes you take care of followership somewhere else. You can store that data in Riak too, but the code is oblivious to its whereabouts.

Conflicts, siblings, oh my!

The fun starts when two clients update the same timeline, you get a conflict and siblings. The strength of a simple data structure like the example above is that they're easy to merge together while still keeping ordering based on the ids. The ids are ordered only in this example, Twitter somewhat makes sure they are.

When you get a conflict, a smart Riak library like Ripple helps you find out about it. To add on the earlier example, here's a version of add that detects conflicts.

Suddenly you have two or more objects instead of one, each containing a different timeline. To turn them into a single list, you merge all of them together, discard the duplicates, and restore order based on the id. Here's some Ruby to do that.

You iterate over all timeline objects and keep adding unique activities to a new list, returning that when done.

Sort, and done!

All that's left to do is sort the result.

There's the whole code. To spare you the pain of having to write your own library, all this is bundled into a gem called riaktivity. If you're doing Python, the Brett Hoerner has got you covered with timak. Be sure to watch the original talk by Yammer, it's well worth it.

There's ups and downs to this approach, and things that need to be taken care of. More on that and modeling data for Riak in general is covered in the Riak Handbook, the definitive guide on Riak.

Tags: riak, nosql

Key rack

One of the most common questions to ask about Riak is: how do I get a list of all the keys in my bucket, in the cluster, or that have an attribute that matches my query using MapReduce?

The motivation behind it is simple: you want to delete all the keys in a bucket, count the amount of keys stored in your cluster entirely, you want to clear out your cluster or you want to run ad-hoc queries on the data stored in a bucket.

All valid in their own right.

But things are not so simple with Riak. To understand why, let's take a quick look under the covers.

What's in a bucket?

A bucket is a namespace in Riak. It's not a physically distinctive entity like a table in a relational database. You can set some properties on it, things like replication levels, commit hooks, quorum, but that's it. Those are stored in the cluster's configuration which is gossiped around the cluster just like the data that identifies what partition goes on which machine.

In fact, when you specify a bucket and a key to fetch or write some data, they're stuck together to find the location in the cluster. Consider a bucket-key combination like users/roidrage. To find the location, Riak hashes both, not just the key. Both bucket and key uniquely identify a piece of data, allowing you to have multiple object with the same key, but in different buckets.

When an object is stored in Riak's storage backends, it uses both bucket and key name to identify it. What you get as a result are files that contain an abundance of different bucket-key combinations and their respective objects, sometimes not even in any order. The only physical distinction Riak has for data on disk is the partition they belong to. Everything else is up to the storage backend. There's no distinction between buckets on disk.

One reason for this is consistent hashing. If you remember the last installment of this series, I mentioned that consistent hashing's downside is that you lose key ordering. Keys are randomly spread out through the cluster. Some ordering still exists depending on the backend, but in general, ordering is lost.

Listing all of the keys

So to list keys in a bucket, Riak has to go through all of the keys in every partition, and I mean ALL OF THEM. Here's a picture of keys and an impersonation of Riak, having to take care of all of them.

No big deal, right? Unless of course, you store millions and millions of them, and want to find about all the keys from say, the bucket users, which may not even have to be more than 1000. To do that, Riak goes through every partition, every partition loads the keys either from memory (Bitcask) or disk (LevelDB) and sifts through them, finding the ones belonging to the users bucket.

All that said, it's certainly not impossible to do, if you have some time to wait, depending on the amount of data stored.

$ curl 'localhost:8098/buckets/users/keys?keys=true'

But wait, don't do that. Do this instead, streaming the keys instead of waiting for them all to arrive and then having them dumped at once.

$ curl 'localhost:8098/buckets/users/keys?keys=stream'

That's much better.

Listing keys has an impact on your Riak nodes, so if you can avoid it, don't do it!

So how do I really get all of the keys?

If select * from riak is not a great option, then what is?

Instead of relying on Riak, build an index on the keys. Thanks to Riak 2i (Secondary Indexes), this is easy. In fact, you get indexing of keys for free when using the LevelDB backend, just use the index $key. This takes advantage of LevelDB's sorted file structure. Neat!

But, and here's the kicker, you can only fetch ranges of keys. So instead of asking for all the keys, you ask for a range large enough to fit all the keys.

$ curl 'localhost:8098/buckets/users/index/$key/0/zzzz'

This finds all the keys that start with something lexicographically larger than 0 and less than zzzz and returns them to you in a list. Now there's a slim chance you'll get users with names like that, but I'll leave that exercise, or proper validations, up to you.

Using that list, you can count the number of keys in that bucket, or you can delete them one by one.

Ideally...

In an ideal world, listing keys in a bucket would be possible and not an expensive operation. Riak could for example allow users to store buckets in separate files. The downside is that with a lot of buckets, you'll hit the limits of open file descriptors in no time, a bit of a bummer. But until something better comes along, secondary indexes are a nice tool to at least avoid resorting to listing all of the keys.

Curious about other ways to index and query data in Riak? You'll like the Riak Handbook, which will be published later this week. Covers Riak's secondary indexes and other strategies to query, inspect and analyze data.

Check back in tomorrow for an introduction on storing timelines in Riak.

Tags: riak, nosql