I always forget what kinds of crazy things you can do with Ruby's blocks and their parameters, so here's a little write-up on them. I regularly forget things I've learned (must be an age thing), and I found that not even books on the Ruby language fully cover all the gory details on block (and method) parameters. So consider this my personal reference of crazy Ruby block syntax features for future use.

The Basics

In its simplest form, a block parameter is a list of names, to which values passed into the block are assigned. The following iterates over all elements in the hash, emitting key and the corresponding value, printing both key and value.

blk = ->(key, value) {puts key, value}
{username: "roidrage", fullname: "Mathias Meyer"}.each &blk

Yes, that's the boring bit, bear with me.

Splat Parameters

You can use splat parameters too, catching an array of all arguments.

blk = ->(*args) {puts args}
blk.(1, 2, 3) 
# => [1, 2, 3]

Notice that crazy syntax for calling the block too, pretty wild. The fun starts when you combine a splat parameter with one that's fixed.

blk = ->(first, *tail) {puts first}
blk.(1, 2, 3)
# => 1

Why not put another fixed parameter at the end? That'll assign the first element of the arguments to the variable first, the last element to last, and everything in between to middle

blk = ->(first, *middle, last) {puts last}
blk.(1, 2, 3)
# => 3

This can grow to an arbitrary complexity, adding more fixed parameters before and after a splat. In this example, middle will just be an empty array, as the fixed parameters are greedy and steal all the values they can match.

blk = ->(first, *middle, second_last, last) {puts second_last}
blk.(1, 2, 3)
# => 2

Fortunately you can have only one splat parameter.

Just for fun, you can also just specify the splat operator and nothing else.

blk = ->(*) {puts "lolwut?"}

Default parameters

If you want to save some time, you can assign default values to block parameters.

blk = ->(list = [1, 2, 3]) {list.sample}
blk.()

Again, you knew that already. Here's some craziness though, courtesy of Avdi.

blk = ->list = (default = true; [1, 2, 3]) {puts default}
blk.()
# => true
blk.([4, 5, 6])
# => nil

Referencing Other Parameters

You can reference parameters previously defined in the list. Want to do an impromptu mapping on the list above before even entering the block?

blk = ->(list = [1, 2, 3], sum=list.inject(1) {|acc, value| acc*value})) {
  list.sample
}

Don't do that, though. But good to know you can. Note that it only works for parameters listed before the one you're assigning to. You can also shorten the example above by quite a bit.

blk = ->(list = [1, 2, 3], sum=list.inject(:*)) {
  list.sample
}

Block-local parameters

To add more character variety, you can declare variables local to the block by adding a semicolon and another list of parameters. Helps when you want to make sure variables used in the block don't accidentally overwrite or reference variables outside the block's scope. Blocks are closures, so they reference their environment, including variables declared outside the block.

username = "roidrage"
blk = ->(tags; username) {username = "mathias"}
blk.(["nosql", "cloud"])
puts username
# => "roidrage"

You'll be pleased to hear that there's no craziness you can do with block local parameters, like assigning defaults.

Ignoring arguments

This one may look familiar to folks knowledgeable in Erlang. You can ignore specific arguments using _. Combine that with the splat parameter and you can extract the tail of a list while ignoring the first element. Then you can recursively iterate through the tail, ignoring the first element.

blk = ->(_, *tail) {blk.(tail) if tail.size > 0}

When is this useful? Ruby is not a pattern-matching language after all. For instance, imagine an API that expects blocks handed to a method call to expect a certain number of arguments. Ruby gives you warning if the block's arity doesn't match the number it was called with. This way you can silently dump parameters you're not interested in while still conforming to the API.

Okay, I lied to you, this is actually not an operator of sorts, this is a simple variable assignment to a variable called _. It's a neat little trick though to make it obvious that you're not interested in a certain parameter. Also note that _ in irb references the value returned by the last expression executed.

You'll find it used in several places in the Rack source code.

Tuple arguments

This one blew my mind when I found it somewhere in the depths of Rack's source (or somewhere else I don't remember). Think of a hash where each key points to an array of things. Wouldn't it be nice if you could extract them all in one go while iterating over them without having to first iterate over the hash and then over the embedded arrays?

Turns out, the tuple operator is just what we need for this. This is an example from a Chef cookbook I built a while back, specifying some thresholds for an Apach configuration for Monit.

apache_server_status = {
  waitlimit: ["<", "10%"],
  dnslimit: [">", "25%"]
}

apache_server_status.each do |name, (operator, limit)|
  puts "protocol apache-status #{name} #{operator} #{limit}" 
end

Notice the definition of (operator, limt). That little bead nicely extracts the array with operator and a percentage in it into two parameters. Here's another thing that blew my mind, chaining enumerators, collecting values and index from a hash, for example. Note that hashes are sorted in Ruby 1.9, so this is a perfectly valid thing to do.

names = {username: "roidrage", fullname: "Mathias Meyer"}
names.each.with_index do |(key, value), index|
  puts "##{index}: #{key}:#{value}"
end

Hat tip to Sam Elliott for blowing my mind with this.

The end

Know what the best part is? All this works with method calls too. The tuple assignment, using _ to ignore values, the splats, referencing previous parameters.

Some of them even work for multiple variable assigmnents.

_, *tail = [1, 2, 3]
name, (operator, limit) = [:waitlimit, ["<", "10%"]]

Crazy.

Tags: ruby, blocks

Amazon's recent release of DynamoDB, a database whose name is inspired by Dynamo, the key-value database the distributed datastore they've been running in production for a good five to six years now. I think it's great they've finally done it, though from my obverservations, there's little resemblance of what the original Dynamo paper describes, but I'm getting ahead of myself. Traditionally Amazon hasn't been very open about how they implement their services, so some of what I'm stating here may be nothing more than an educated guess. Either way, the result is pretty neat.

Time to take a good look at what it has to offer, how that works out in code, and to make some wild guesses as to what's happening under the covers. I'm using fog's master to talk to DynamoDB in the examples below, but the official Ruby SDK also works. Fog is closer to the bare API metal though, which works out well for our purposes.

My goal is not to outline the entire API and its full set of options, but to dig into the bits most interesting to me and to show some examples. A lot of the focus in other posts is on performance, scalability and operational ease. I think that's a great feature of DynamoDB, but it's pretty much the same with all of their web services. So instead I'm focusing on the effects DynamoDB has on you, the user. We'll look at API, general usage, data model and what DynamoDB's feature generally entails.

The Basics

DynamoDB is a distributed database in the cloud, no surprises, it's not the first such thing Amazon has in its portfolio. S3, SimpleDB, RDS and DynamoDB all provide redudant ways to store different types of data in Amazon's datacenters. Currently, DynamoDB only supports the us-east region.

An important difference is that data stored in DynamoDB is officially stored on SSDs, which has (or at least should have) the benefit of offering predictable performance and greatly reduced latency across the board. The question that remains is, of course: when can I hook up SSDs to my EC2 instances?

The other big deal is that the read and write capacity available to you is configurable. You can tell Amazon how much capacity, in read and write units per second, you expect for your application, and they make sure that capacity is available to you as long as you need it (and pay for it).

Data Model

Data in DynamoDB is stored in tables, a logical separation of data if you will. Table names are only unique on a per-user basis, not globally like S3 buckets.

Tables store items, think rows of data or documents. Items have attributes and values, similar to SimpleDB. Think of it as a JSON document where you can read and write attributes independent of the entire document. Every row can have different attributes and values, but every item needs to have a uniquely identifying key whose name you decide on upfront, when you create the table.

Let's create a table first, but make sure you wear protection, as this is yet another of Amazon's gross-looking APIs. The relevant API action is CreateTable.

dynamo = Fog::AWS::DynamoDB.new(aws_access_key_id: "YOUR KEY",
  aws_secret_access_key: "YOUR SECRET")

dynamo.create_table("people",
  {HashKeyElement: {AttributeName: "username", AttributeType: "S"}},
  {ReadCapacityUnits: 5, WriteCapacityUnits: 5})

This creates a table called "people" with an attribute "username" as the indexing key. This is the key you're using to reference a row, sorry, an item, in a table. The key is of type string, hence the S, you can alternatively specify an N for numeric values. This pattern will come up with every attribute and value pair you store, you always need to tell DynamoDB if it's a string or a number. You also specify initial values for expected read and write capacity, where 5 is the minimum.

You have to provide a proper value for a key, DynamoDB doesn't offer any automatic generation of keys for you, so choose them wisely to ensure a good distribution. A key, that has a lot of data behind it, and that's accessed much more frequently than others will not benefit you. Keep your data simple and store it with multiple keys if you can.

Anyhoo, to insert an item, you can use the PutItem API action, to which you issue a HTTP POST (go figure). The DynamoDB API already gives me headache, but more on that later. Thankfully, a good client library hides the terrible interface from us, only leaving us with the hash structures sent to DynamoDB.

dynamo.put_item("people", {username: {"S" => "roidrage"}})

Specify multiple attributes as separate hash entries, each pointing to a separate hash specifying data type and value. I'll leave it to you to think about how backwards this is, though it must be said that JavaScript is also to blame here, not handling 64 bit integers very well.

You can store lists (or rather sets of data) too, using SS as the datatype.

dynamo.put_item("people", {username: {S: "roidrage"},
                               tags: {SS: ["nosql", "cloud"]}})

Note that you can use PutItem on existing items, but you'll always replace all existing attributes and values.

Conflicts and Conditional Writes

All writes can include optional conditions to check before updating an item. That way you can ensure the item you're writing is still in the same state as your local copy of the data. Think: read, update some attribute, then write, expecting and ensuring to not have any conflicting updates from other clients in between.

This is a pretty neat feature, you can base updates on one attributes based on whether another attribute exists or has a certain value.

dynamo.put_item("people",
    {username: {S: "roidrage"}, fullname: {S: "Mathias Meyer"}},
    {Expected: {fullname: {Exists: false}}})

This operation is only executed if the item as currently stored in DynamoDB doesn't have the attribute the attribute fullname yet. On a subsequent PutItem call, you could also check if the item's fullname attribute still has the same value, e.g. when updating the full name of user "roidrage".

dynamo.put_item("people",
    {username: {S: "roidrage"}, fullname: {S: "Mathias Meyer"}},
    {Expected: {fullname: {Value: {S: "Mathias Meyer"}}}})

It's not the greatest syntax for sure, but it does the job. Which brings me to some musings about consistency. If the condition fails, Amazon returns a 400 response, just like with every other possible error you could cause.

To make proper use of conditional updates in your application and to actually prevent conflicting writes and lost updates, you should use the UpdateItem action instead and only update single attributes, as PutItem always replaces the entire item in the database. But even then, make sure to always reference the right attributes in an update. You could even implement some sort of versioning scheme on top of this, for instance to emulate multi-version concurrency control.

Updating single or multiple attributes is done with the UpdateItem action. You can update attributes without affecting others and add new attributes as you see fit. To add a new attribute street, here's the slightly more verbose code.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
    {street: {Value: {S: "Gabriel-Max-Str. 3"}}})

There are more options involved than with PutItem, and there are several more available for UpdateItem. But they'll have to wait for just a couple of sentences.

Consistency on Writes

Conditional updates are the first hint that DynamoDB is not like Dynamo at all, as they assume some sort of semi-transactional semantics. Wherein a set of nodes agree on a state (the conditional expression) and then all apply the update. The same is true for atomic counters, which we'll look at in just a minute.

From the documentation it's not fully clear how writes without a condition or without an atomic counter increment are handled, or what happens when two clients update the same attribute at the same time, and who wins based on which condition. Facing the user, there is no mechanism to detect conflicting writes. So I can only assume DynamoDB either lets the last write win or has a scheme similar to BigTable, using timestamps for each attribute.

Writes don't allow you to specify something like a quorum, telling DynamoDB how consistent you'd like the write to be, it seems to be up to the system to decide when and how quickly replication to other datacenters is done. Alex Popescu's summary on DynamoDB and Werner Vogels' introduction suggest that writes are replicated across data centers synchronously, but doesn't say to how many. On a wild guess, two data centers would make the write durable enough, leaving the others to asynchronous replication.

Consistency on Reads

For reads on the other hand, you can tell DynamoDB if stale data is acceptable to you, resulting in an eventually consistent read. If you prefer strongly consistent reads, you can specify that on a per-operation basis. What works out well for Amazon is the fact that strongly consistent reads cost twice as much as eventually consistent reads, as more coordination and more I/O are involved. From a strictly calculating view, strongly consistent write take up twice as much capacity as eventually consistent writes.

From this I can only assume that writes, unless conditional will usually be at least partially eventually consistent or at least not immediately spread out to all replicas. On reads on the other hand, you can tell DynamoDB to involve more than just one or two nodes and reconcile outdated replicas before returning it to you.

Reading Data

There's not much to reading single items. The action GetItem allows you to read entire items or select single attributes to read.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}})

Optionally, add the consistency option to get strong consistency, with eventual consistency being the default.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}},
                          {ConsistentRead: true})

A read without any more options always returns all data, but you can select specific attributes.

dynamo.get_item("people", {HashKeyElement: {S: "roidrage"}},
                          {AttributesToGet: ["street"]})

Atomic Counters

Atomic counters are a pretty big deal. This is what users want out of pretty much every distributed database. Some say they're using it wrong, but I think they're right to ask for it. Distributed counters are hard, but not impossible.

Counters need to be numerical fields, and you can increment and decrement them using the UpdateItem action. To increment a field by a value, specify a numerical attribute, the value to be incremented by, and use the action ADD. Note that the increment value needs to be a string.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {logins: {Value: {N: "1"}, Action: "ADD"}})

When the counter attribute doesn't exist yet, it's created for you and set to 1. You can also decrement values by specifying a negative value.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {logins: {Value: {N: "-1"}, Action: "ADD"}})

Can't tell you how cool I think this feature is. Even though we keep telling people that atomic counters in distributed system are hard, as they involve coordination and increase vulnerability to failure of a node involved in an operation, systems like Cassandra and HBase show that it's possible to do.

Storing Data in Sets

Other than numerical (which need to be specified as strings nonetheless) and string data types, you can store sets of both too. One member can only exist once in an attribute set. The neat part is that you can atomically add new members to sets using the UpdateItem action. In the JSON format sent to the server, you always reference even just single members to add as a list of items. Here's an example:

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["nosql"]}}})

That always replaces the existing attribute though. To add a member you need to specify an action. If the member already exists, nothing happens, but the request still succeeds.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["cloud"]}, Action: "ADD"}})

You can delete elements in a similar fashion, by using the action DELETE.

dynamo.update_item("people", {HashKeyElement: {S: "roidrage"}},
                             {tags: {Value: {SS: ["cloud"]}, Action: "DELETE"}})

The default action is PUT, which replaces the listed attributes with the values specified. If you run PUT or ADD on a key that doesn't exist yet, the item will be automatically created. This little feature and handling of sets and counters in general sounds a lot like things that made MongoDB so popular, as you can atomically push items onto lists and increment counters.

This is another feature that's got me thinking about whether DynamoDB even includes the tiniest bit of the original Dynamo. You could model counters and sets based on something like Dynamo for sure, based on the ideas behind Commutative Replicated Data Types. But I do wonder if Amazon actually did go through all the trouble building that system on top of the traditional Dynamo, or if they implemented something entirely new for this purpose. There is no doubt that operations like these is what a lot of users want even from distributed databases, so either way, they've clearly hit a nerve.

Column Store in Disguise?

The fact that you can fetch and update single attributes with consistency guarantees makes me think that DynamoDB is actually more like a wide column store like Cassandra, HBase or, gasp, Google's BigTable. There doesn't seem to be anything left from the original, content-agnostic idea of the Dynamo data store, whose name DynamoDB so nicely piggybacks on.

The bottom line is there's always a schema of attributes and values for a particular item you store. What you store in an attribute is up to you, but there's a limit of 64KB per item.

DynamoDB assumes there's always some structure in the data you're storing. If you need something content-agnostic to store data but with similar benefits (replication, redundancy, fault-tolerance), use S3 and CloudFront. Nothing's keeping you from using several services obviously. If I used DynamoDB for something it'd probably not be my main datastore, but an add-on for data I need to have highly available and stored in a fault-tolerant fashion, but that's a matter of taste.

A Word on Throughput Capacity

Whereas you had to dynamically add capacity in self-hosting database systems, always keeping an eye on current capacity limits, you can add more capacity to handle more DynamoDB request per second simply by issuing an API call.

Higher capacity means paying more. To save money, You could even adjust the capacity for a specific time of day basis, growing up and down with your traffic. If you go beyond your configured throughput capacity, operations may be throttled.

Throughput capacity is based on size of items you read and the number of reads and writes per second. Reading one item with a size of 1 KB corresponds to one read capacity unit. Add more required capacity units with increased size and number of operations per second.

This is the metric you always need to keep an eye on and constantly measure from your app, not just to validate invoice Amazon sends you at the end of the month, but also to track your capacity needs at all times. Luckily you can track this using CloudWatch and trigger alerts. Amazon can trigger predefined alerts when capacity crosses a certain threshold. Plus, every response to a read includes the consumed read capacity, and the same is true for writes.

Throughput capacity pricing is pretty ingenious on Amazon's end. You pay for what you reserve, not for what you actually use. As you always have to have more capacity available as you currently need, you always need to reserve more in DynamoDB. But if you think about it, this is exactly how you'd work with your own hosted, distributed database. You'll never want to work very close to capacity, unless you're some crazy person.

Of course you can only scale down throughput capacity once per day, but scale up as much as you like, and increases need to be done at least 10%. I applaud you for exploiting every possible opportunity to make money, Amazon!

Data Partitioning

Amazon's documentation suggests that Amazon doesn't use random partitioning to spread data across partitions, partitioning is instead done per table. Partitions are created as you add data and as you increase capacity, suggesting either some sort of composite key scheme or a tablet like partitioning scheme, again, similar to what HBase or BigTable do. This is more fiction than fact, it's an assumption on my part. A tablet-like approach certainly would make distributing and splitting up partitions easier than having a fixed partitioning scheme upfront like in Dynamo.

The odd part is that Amazon actually makes you worry about partitions but doesn't seem to offer any way of telling you about them or how your data is partitioned. Amazon seems to handle partitioning mostly automatically and increases capacity by spreading out your data across more partitions as you scale capacity demand up.

Range Keys

Keys can be single values or based on a key-attribute combination. This is a pretty big deal as it effectively gives you composite keys in a distributed database, think Cassandra's data model. This effectively gives you a time series database in the cloud, allowing you to store sorted data.

You can specify a secondary key on which you can query by a range, and which Amazon automatically indexes for you. This is yet another feature that makes DynamoDB closer to a wide column store than the traditional Dynamo data store.

The value of the range key could be an increasing counter (though you'd have to take care of this yourself), a timestamp, or a time based UUID. Of course it could be anything else and unique entirely, but time series data is just a nice example for range keys. The neat part is that this way you can extract data for specific time ranges, e.g. for logging or activity feeds.

We already looked at how you define a normal hash key, let's look at an example with a more complex key combining a hash key and a range key, using a numerical type to denote a timestamp.

dynamo.create_table("activities", {
      HashKeyElement: {AttributeName: "username", AttributeType: "S"},
      RangeKeyElement: {AttributeName: "created_at", AttributeType: "N"}},
    {ReadCapacityUnits: 5, WriteCapacityUnits: 5})

Now you can insert new items based on both the hash key and a range key. Just specify at least both attributes in a request.

dynamo.put_item("activities", {
    username: {S: "roidrage"}, 
    created_at: {N: Time.now.tv_sec.to_s},
    activity: {S: "Logged in"}})

The idea is simple, it's a timestamp-based activity feed per user, indexed by the time of the activity. Using the Query action, we can fetch a range of activities, e.g. for the last 24 hours. Just using get_item, you always have to specify a specific combination of hash and range key.

To fetch a range, I'll have to resort to using Amazon's Ruby SDK, as fog hasn't implemented the Query action yet. That way you won't see the dirty API stuff for now, but maybe that's a good thing.

dynamo = AWS::DynamoDB.new(access_key_id: "YOUR KEY",
                           secret_access_key: "YOUR SECRET")
activites = dynamo["activities"]

items = activities.items.query(
    hash_key: "roidrage",
    range_greater_than: (Time.now - 85600).tv_sec)

This fetches all activity items for the last 24 hours. You can also fetch more historic items by specifying ranges. This example fetches all items for the last seven days.

items = activities.items.query(
    hash_key: "roidrage",
    range_greater_than: (Time.now - 7.days.ago).tv_sec,
    range_less_than: Time.now.tv_sec)

Note that the Query action is only available for tables with composite keys. If you don't specify a range key, DynamoDB will return all items matching the hash key.

Queries using Filters

The only things that's missing now is a way to do richer queries, which DynamoDB offers by way of the Scan action. fog doesn't have an implementation for this yet, so we once again turn to the AWS Ruby SDK.

Scanning allows you to specify a filter, which can be an arbitrary number of attribute and value matches. Once again the Ruby SDK abstracts the ugliness of the API underneath into something more readable.

activities.items.where(:activity).begins_with("Logged").each do |item|
  p item.attributes.to_h
end

You can include more than one attribute and build arbitrarily complex queries, in this example to fetch only items related to the user "roidrage" and him logging in.

activities.items.where(:activity).equals("Logged in").
          and(:username).equals("roidrage").each do |item|
  p item.attributes.to_h
end

You can query for ranges as well, combining the above with getting only items for the last seven days.

activities.items.where(:activity).equals("Logged in").
    and(:username).equals("mathias").
    and(:created_at).between(7.days.ago.tv_sec, Time.now.tv_sec).each {|item|
  p item.attributes.to_h
end

Filters fetch a maximum of 1 MB of data. As soon as that's accumulated, the scan returns, but also includes the last item evaluated, so you can continue where the query left off. Interestingly, you also get the number of items remaining. Something like paginating results dynamically comes to mind. Running a filter is very similar to running a full table scan, though it's rather efficient thanks to SSDs. But don't expect single digit millisecond responses here. As scans heavily affect your capacity throughput, you're better off resorting to their use only for ad-hoc queries, not as a general part of your application's workflow.

Unlike Query and GetItem, the Scan action doesn't guarantee strong consistency, and there's no flag to explicitly request it.

The API

DynamoDB's HTTP API has got to be the worst ever built at Amazon, you could even think it's the first API every designed at Amazon. You do POST requests even to GET data. The request and response can include conditions and validations and their respective errors, the proper Java class name of an error and other crap. Not to mention that every error caused by any kind of input on your end always causes a 400.

Here's an example of a simplified request header:

POST / HTTP/1.1
Content-Type: application/x-amz-json-1.0
x-amz-target: DynamoDB_20111205.GetItem

And here's a response body from an erroneous request:

{"__type":"com.amazon.coral.validate#ValidationException",
"message":"One or more parameter values were invalid:
The provided key size does not match with that of the schema"}

Lovely! At least there's a proper error message, though it's not always telling you what the real error is. Given that Amazon's documentation is still filled with syntactical errors, this is a bit inconvenient.

The API is some bastard child of SOAP, but with JSON instead of XML. Even using a pretty low level library like fog doesn't hide all the crap from you. Which worked out well in this case, as you see enough of the API to get an idea about its basic workings.

The code examples above don't read very Ruby like as I'm sure you'll agree. Though I gotta say, the Ruby SDK provided by AWS feels a lot more like Ruby in its interface.

I don't have very high hopes to see improvements on Amazon's end, but who knows. S3, for example, got a pretty decent REST API eventually.

Pricing

Pricing is done (aside from throughput capacity) per GB stored. If you store millions of items, and they all exceed size of just a few KB, expect to pay Amazon a ton of money, storage pricing trumps throughput pricing by a lot. Keep data stored in an item small. If you only store a few large items, it works too, but you may end up being better off choosing one of Amazon's other storage options. You do the math. Pricing for storage and the maximum size for a single item always includes attribute names, just like with SimpleDB.

To give you an idea how pricing works out, here's a simple calculation. 100 GB of data, 1000 reads per second, 200 writes per second, item size is 4 KB on average. That's $1253.15 every month, not including traffic. Add 10 GB of data the second month and you're at $1263.15. You get the idea. Pricing is much more affected by read and write operations vs. item size. Make your items 6 KB in size, and you're already at $1827.68.

Bottom Line

Though Amazon is doing a pretty good job at squeezing the last drop of money out of their customers using DynamoDB, think about what you're getting in return. No operations, no hosting facilities, let alone in three different datacenters, conditional writes and atomic counters, and a database that (I assume) has years of experience in production forged into it.

As usually the case with Amazon's Web Services, using something like DynamoDB is a financial tradeoff. You get a data store that scales up and down with your demand without you having to worry about the details of hardware, operations, replication and data performance. In turn you pay for every aspect of the system. The price for storage is likely to go down eventually, making it a more attractive alternative to hosting an open source NoSQL database system yourself. Whether this is an option for your specific use case, only you're able to make that decision.

If you store terrabytes of data, and that data is worth tens of thousands of dollars per month in return for not having to care about hosting, by all means, go for DynamoDB. But at that size, just one or two months of hosting on Amazon pays off buying servers and SSDs for several data centers. That obviously doesn't cover operational costs and personell, but it's just something to think about.

Closing Thoughts

Sorted range keys, conditional updates, atomic counters, structured data and multi-valued data types, fetching and updating single attributes, strong consistency, and no explicit way to handle and resolve conflicts other than conditions. A lot of features DynamoDB has to offer remind me of everything that's great about wide column stores like Cassandra, but even more so of HBase. This is great in my opinion, as Dynamo would probably not be well-suited for a customer-facing system. And indeed, Werner Vogel's post on DynamoDB seems to suggest DynamoDB is a bastard child of Dynamo and SimpleDB, though with lots of sugar sprinkled on top.

Note that it's certainly possible and may actually be the case that Amazon has built all of the above on top of the basic Dynamo ingredients, Cassandra living proof that it's possible. But if Amazon did reuse a lot of the existing Dynamo code base, they hid it really well. All the evidence points to at least heavy usage of a sorted storage system under the covers, which works very well with SSDs, as they make sequential writes and reads nice and fast.

No matter what it is, Amazon has done something pretty great here. They hide most of the complexity of a distributed system from the user. The only option you as a user worry about is whether or not you prefer strong consistency. No quorum, no thinking about just how consistent you want a specific write or read operation to be.

I'm looking forward to seeing how DynamoDB evolves. Only time will tell how big of an impact Amazon's entering the NoSQL market is going to have. Give it a whirl to find out more about it.

Want to know how the original Dynamo system works? Have a look at the Riak Handbook, a comprehensive guide to Riak a distributed database that implements the ideas of Dynamo and adds lots of sugar on top.

I'm happy to be proven wrong or told otherwise about some of my assumptions here, so feel free to get in touch!

Resources

Be sure to read Werner Vogels' announcement of DynamoDB, and Adrian Cockcroft's comments have some good insights on the evolution of data storage at Netflix and how Cassandra, SimpleDB and DynamoDB compare.

Tags: nosql, dynamo, amazon

Update: Added paragraph on the e-commerce platform I'm using.

Several people have asked me about the publishing tool chain I'm using for the Riak Handbook and the NoSQL Handbook. As Avdi just published his, now's as good a good time as any to throw in mine too. After all, it's always nice to know all the options you have.

All text is written in Markdown. I started out with Textile, but just didn't like Textile's way of breaking down single lines into separate paragraphs, so I switched to Markdown later on. Not a painless process, but sufficiently easy thanks to Pandoc.

The Markdown files, one per chapter, are bundled together into a big file and then fed into GitHub's Redcarpet library. I use Albino and Pygments for syntax highlighting, embedded into a custom renderer for Redcarpet. See the Redcarpet README for an example. Redcarpet also has some MultiMarkdown extensions for tables and other things, so that comes in pretty handy.

To generate the PDF I'm using Prince, a neat little tool that takes HTML and a bunch of CSS and generates a nice PDF. It's a commercial tool, and costs some $500 for a single license. That was what threw me off a tiny bit, but luckily there's a handy service called DocRaptor which is basically a pretty API around Prince, but much much cheaper.

You can use Prince locally to generate infinite test PDFs, and then use DocRaptor to generate the final result. Which is what I do. Before generating the PDF through DocRaptor I upload the generated HTML, which includes all the CSS, to S3, to avoid some issues I've had with Unicode characters in the HTML, and tell DocRaptor to fetch it from a URL instead.

You can use web fonts easily, I went for fonts from Google Web Fonts, which are easily embeddable with Prince and DocRaptor. Prince also allows you to do some customizations specific, stuff like page titles, page numbers, custom settings for left and right pages, footnotes, and the like. See the documentation for a full list.

For ePub generation I generate a stripped-down version of the HTML, not including syntax highlighting, because iBooks doesn't like custom-styled pre or code elements, if you style them iBooks chooses to ignore any font setting and use the default font, which is usually not monospaced.

The HTML then goes through Calibre to generate the ePub file, and then I use kindlegen to generate a Kindle file from that. Calibre has a pretty ugly user interface, but it comes with a set of command line tools you can use to automatically convert ebooks from one format to another.

All of the above is wrapped into a Rakefile and some small Ruby classes. I'll be sure to put it up for your perusal soon, but this list should get you started.

To make the book available for purchase I'm using FastSpring. They're not my favorite choice, as they make things like customizing your shop fairly hard, but I chose them because their way of payout (monthly or bi-monthly), and the fact that they take care of VAT, were both reasons important to me, especially being in Europe, where tax laws throw a lot of stones at you for wanting to try and sell things internationally. Unfortunately FastSpring doesn't have any means of updating products and allowing and notifying users of the updates being available, so I built a small Sinatra app to capture orders and take care of sending out the updates. Not pretty, but it's a small sacrifice compared to the tax hassle I'd have with any other fulfillment provider.

Curious about the result? You should buy the book!