Let me go ahead and say it: I love distributed systems. Why? Simply because they bend my brain. Yesterday I tweeted "Distributed databases are my happy place." One response I got was along the lines of: "then you're probably not running a distributed database in production." Busted! But does it matter? We all love distributed stuff, we love thinking about scaling. They seem like problems everyone wants to have and solve.

But the truth is, I don't, and I can assure you, you don't want to either, sometimes I doubt my brain is even capable properly solving these problems, but that doesn't prevent me from trying. I prefer to work on as small a scale as possible, you could even say I hate distributed systems. Scaling and distribution is a problem most of us don't have, and are probably better of not having.

Truth be told, I'm not highly interested in running highly distributed systems in production, quite the opposite. I prefer maxing out what I have as far up as possible. Sometimes I do take the plunge and just try something new in production, but I'm happy prepared to replace it with something different, even something simpler, if that seems like the better option in the end. Everyone should experiment at some point, but not all the time.

But why then do I love distributed systems? Simply because they make me think about how they could be put to use, what algorithms and the problems involved are, and what implications they would have on a production system, both from an operations and developer perspective. That's where the value is for me, it allows me to simply make informed decisions when the time comes.

Take Riak, for example, on which I gave a shortish talk at yesterday's meet-up of the local Ruby brigade. Riak's distribution model is based on Amazon's Dynamo implementation, with some neat features sprinkled on top. Riak is built by a bunch of really, really smart guys at Basho, whose work I have nothing but respect for, but who also are sane and open enough to tell people when their database may or may not be a good fit (something a certain other database is severly lacking).

Riak is exciting for me because it was the first database that really made me dive into Amazon's Dynamo, and once I started grokking it, it blew my mind. If you haven't read it, please do. It blew my mind simply because it introduced me to a whole new thinking, to heavily distributed storage, with all the potential hot spots, downsides and business use cases for specifics parts of it thrown in. The same is true for Google's BigTable. The technologies involved with both are true mind-benders.

And there's my bottom line. Distributed systems aren't necessarily awesome just because they allow scaling to infinite heights (exaggeration intended), but because they broaden your personal horizon. It's like learning new programming languages. It's about getting new ideas in your head, ideas outside of your everyday working realm. Ideas you can maybe even take back to what you're working on and start applying them where it makes sense, and only if it makes sense. Learning about distributed systems is not just about learning how to use them, but when. Knowing is half the battle.

While you're at it, check out Evan Weaver's "Distributed Systems Primer", a collection of papers on distributed systems, or the papers collection over at NoSQL Summer. Get ready to have your mind blown in whole new ways. Say what you will, that stuff is just fascinating. It appeals to the distributed database lover in me.

Update: Read the comments and below. The issue is not as bad as it used to be in the documentation and the original design, thankfully.

A lot has happened since I've first written about MongoDB back in February. Replica Pairs are going to be deprecated, being replaced by Replica Sets, a working Auto-Sharding implementation, including rebalancing shards, and lots more, all neatly wrapped into the 1.6 release.

The initial draft on how they'd turn out sounded good, but something struck me as odd, and it is once again one of these things that tend to be overlooked in all the excitement about the new features. Before we dive any deeper, make sure you've read the documentation, or check out this rather short introduction on setting up a Replica Set, I won't go into much detail on Replica Sets in general, I just want to point out one major issue I've found with them. Part of the documentation sheds some light on the inner workings of Replica Sets. It's not exhaustive, but to me more interesting than the rest of the documentation.

One part struck me as odd, the paragraph on resyncing data from a new primary (as in master). It's two parts actually, but they pretty much describe the same caveat:

When a secondary connects to a new primary, it must resynchronize its position. It is possible the secondary has operations that were never committed at the primary. In this case, we roll those operations back.


When we become primary, we assume we have the latest data. Any data newer than the new primary's will be discarded.

Did you notice something? MongoDB rolls operations back that were never committed to the primary, discarding the updated data, which is just a fancy term for silently deleting data without further notice. Imagine a situation where you just threw a bunch of new or updated data at your current master, and the data has not yet fully replicated to all slaves, when suddenly your master crashes. According to the protocol the node with the most recent opslog entries takes over the primary's role automatically.

When the old master comes back up, it needs to resynchronize the changes from the current master, before it can play any role in the set again, no matter if it becomes the new primary, or sticks to being a secondary, leaving the new master in place. During that resync it discards data that has not been synchronized to the new master yet. If the opslog on the new master was behind a couple of dozen entries before the old one went down, all that data is lost. I repeat: lost. Think about that.

There's ways to reduce the pain, and I appreciate that they're mentioned appropriately in the documentation. You can tell MongoDB to consider a write successful when it replicated to a certain number of secondaries. But you have to wait until that happened, polling getLastError() for the state of the last operation. Or you could set maxLag accordingly, so that the master will fail or block a write until the secondaries catch up with the replication, though I couldn't for the life of me figure out (using the Googles) where and how to set it.

But I don't approve of this behavior as a default, and the fact that you need to go through the internals to find out about it. Everything else suggests that there's no point of failure in a MongoDB setup using sharding and Replica Sets, even comparing it to the Dynamo way of guaranteeing consistency, which it simply isn't when the client has to poll for a successful write.

It's one of those things that make me reconsider my (already improved) opinions on MongoDB all over again, just when I started to warm up with it. Yes, it's wicked fast, but I simply disagree with their take on durability and consistency. The tradeoff (as in: losing data) is simply too big for me. You could argue that these situations will be quite rare, and I would not disagree with you, but I'm not fond of potentially losing data when they do happen. If this works for you, cool! Just thought you should know.

Update: There's been some helpful comments by the MongoDB folks, and there's good news. Data is not silently discarded in 1.6 anymore, apparently it's stored in some flat file, fixed with this issue, though it's hard for me to say from the commits what exactly happens. The documentation does not at all reflect these changes, but improvements are on the way. I'm still not happy about some of the design decision, but they're rooted in the way MongoDB currently works, and changing that is unlikely to happen, but at least losing data doesn't seem to be an option anymore. If making a bit of a fool out of myself helped to improve on the documentation front, so be it. I can live with that.

Hi, I'm Mathias, and I'm a CouchDB user. I've been using it for almost a year now, and we have a project using it in production, with a side of Redis. I think it's an awesome database, some of its features are simply unrivaled. Offline replication, CouchApps, to name a few. CouchDB just hit version 1.0. It's been a long time coming, with CouchDB having probably one of the longest histories in the non-relational database space. I've heard about it first back in September 2008, when Jan Lehnardt talked about it at a local co-working space. I still blame him for getting me all excited about this whole NoSQL thing. Fun fact: I bookmarked the CouchDB website back in February 2008.

The features being added to it with every release are nothing short of exciting. CouchDB 0.11 got filtered replication, support for URL rewriting and vhosts, amongst other things. But there's still some things that annoy me, that somewhat bug me in my daily work with it.

The following things are not incredible pet-peeves I have with CouchDB. I think CouchDB is pretty awesome, and I really like using it. However, it doesn't come without the occasional odditity that will leave you scratching your head. These probably aren't the only things to be aware of, they're just the most annoying to me. Your mileage may vary. They may or may not be annoying to you, but they're things that are good to know working with CouchDB. Whether CouchDB should or should not have what I'm listing here is a whole different story. It's my wishlist of improvements, if you will.

It's also stuff you're buying into when you move off the beaten path of relational databases. As always, some of these are not hard to find out, some of them do only get really annoying once you're moving into production, or when you get a deeper knowledge of the tool at hand. Nothing specific to CouchDB here, but some of the issues listed below stem from actively using it. Take them with a grain of salt. While they may seem annoying at first, they're things you can live with. Believe me, you can.

Views are updated on read access

You can dump in as many documents as you want, and you can create as many map/reduce views as you want. The truth is, they'll only come all together to slow down your application when you're querying the view. Assume you have a good stash of documents in your database, and you decide you need a new view on your data. Throw in the JavaScript functions and go ahead and query the view. Calling it a slow-down may be a stretch at times though, it really depends on how often your data is updated.

CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.

On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read. A way around this would of course be to keep your views warm by reading them regularly, e.g. through a cron job.

When you add new views, be sure to pre-warm them before you first access them in your application. One way would be to add the views at a time where you database isn't accessed as much. It doesn't block all access to the documents, but it sure has a certain impact on your database's performance, and of course the first requests that may time out because CouchDB is building the requested views in the background.

When it comes to just updating a view, and it might take too long, you can set the parameter stale=ok. That way, even if the view data needs to be updated, CouchDB won't update it and just return the last known state of the view's B-tree.

That's all fun and giggles, but when on earth are you supposed to actually update your view? Always reading stale data is not great? I've gotten some odd suggestions when I complained about this elsewhere, but in the end I just want to tell the database that I'm okay with stale data, but that it should update the view in the background.

No automatic compaction

As your database grows and data gets updated, CouchDB leaves old and stale data untouched, appending new data (inserted and updated documents are considered new data) to the end of its database files, a fact that's also true for view files. That has the neat advantage that you can still access old revisions of your documents, but it will also leave your database files growing constantly. Now, depending on the number of documents and updates on them, that might not be a big deal, but it's a good idea to start regular compaction earlier than later.

Riak's Bitcask file backend has a neat way of automatically compacting its files. It appends data in a similar manner as CouchDB, but can determine if a node in the cluster can run compaction on its data, and do so automatically, without much need for human intervention. It'd be nice to have something similar as part of CouchDB without having to run cron jobs to do that.

The append-only mechanism makes CouchDB bullet-proof, no doubt, you'll always have consistent data files on your hard disk, backups are as simple as copying the files elsewhere, or take an EBS volume snapshot at any time. But that level of data consistency comes with a price, and that's an ever-growing data file.

No partial updates

Whenever you update a document in CouchDB, you update it as a whole, there's nothing in between. That kind of makes sense with the way CouchDB works, but as a user it annoys me from time to time. It seems so pointless fetching and sending a whole document when I'm just updating one attribute. There's a neat RFC for the PATCH command in HTTP making the rounds, I'd love to see that end up in CouchDB at some point. No idea how likely that is, the makers of CouchDB have a weird aversion to using diffs to update data.

Note that I'm not talking about the MongoDB way of setting attributes atomically. I don't need that, because it simply doesn't scale well, especially not with the CouchDB storage model, and you're not updating data in-place like MongoDB. It's more about just being able to send a diff or a minor update than a whole document.

You can somewhat fake this using update handlers (look at the view called "in-place") from CouchDB 0.10 on. It's pretty neat, but it's just not the same.

No built-in way to scale up

CouchDB's replication is unrivaled, no doubt. Being able to replicate any database with any other database at any point in time makes CouchDB unique, some say it's the killer feature, and I concur. There's lot of argueing whether or not that defines CouchDB as being distributed. In the most traditional sense, at least to me, it sure does, but I'm not here to nitpick about that. It's easy to scale out by adding more nodes and setting them up to constantly replicate with each other, make anyone a master or slave as you like. But there's no way to distribute write and read access across a cluster of nodes.

CouchDB-lounge has been the traditional way to approaching that, but I never really liked it, because it added more components to the infrastructure. Something like that should really be built in. The good news is that Cloudant is planning on open-sourcing their clustering solution Open Cloudant, which will then hopefully become part of CouchDB. A quorum based system for CouchDB would be neat, and it doesn't seem too far away.

Pagination is awkward

CouchDB's B-tree is a leaky abstraction, that's the conclusion I came to at some point. It has a pretty big impact on your application's code, and that's not necessarily a bad thing. Suddenly you deal with things like conflicts, or simply updating views on reads. But no other part of your web application will make that as obvious as pagination, a pretty common and natural part of a web application.

The path of least resistence to get pagination is to use the skip and limit parameters, but it's not recommended, as you'll still be walking the whole B-tree to determine the number of documents that must be skipped before it can collect the ones you're interested in.

The recommended way to do pagination is a bit awkward if you ask me. There's a good explanation in the CouchDB book, so I'll spare you repeating it here. But be sure to read it, because understanding that takes you half way to understanding the B-tree. It may be awkward, and very different from what you're used to, but that's how the B-tree works. It's not always unicorns and rainbows, sometimes it kinda gets in your way. Trade-offs, meh.

The simpler alternative would of course be to just use endless pagination, where you let the users just click a more button instead of clicking through the pages, because you know the last document displayed in your list, and the key that was used to fetch it. You simply use that key and the last document's id to step directly into the B-tree where you left off. You need to remember to fetch one additional document, as CouchDB will return the last document too, or you can just skip one document, which is acceptable, as skipping just one leaf in a tree is an operation of predictable performance.

Range queries are awkward

To do a range, you have to specify a start and an end key. That's the simple part. It starts getting awkward when your keys get slightly more complex, e.g. when your map function emits arrays. Assume you want to fetch all elements where the first part of the array matches a particular key, and the second part doesn't matter, e.g. when you emitted a timestamp as the second part to keep a natural (in terms of last update for example) order.

Assume your keys look like this: ['123', '2010/07/21'], that's the key format SimplyStored uses to manage associations between documents. To get the range that only matches the first part of the key, your startkey has to look like this: ['123']. This will match all documents having the above key. If you don't specify an endkey, CouchDB will simply return all documents following that key, so you need to specify an endkey. The recommended way to do that is to use the following format: ['123', {}]. That way you'll get all documents matching the first part of the key, because {} is considered to be greater than any string you may have emitted. See the CouchDB wiki on more details on this technique called view collation.

Obviously it's not impossible to do range queries in CouchDB, but it's slightly awkward. It all goes downhill as soon as you want to fetch only a particular subrange of the original one, using startkeydocid or endkeydocid, say for pagination. With the above ranges, they simply don't work. Both need a startkey and endkey that is an exact match. The whole point of the above range query is not to care about the exact start and end key, isn't it?

No CommonJS available in MapReduce functions

With CouchDB 0.11, CommonJS and all its awesomeness became available in view functions. I was pretty excited about it, and I still am. However, map and reduce functions were left out in the cold. Every time I have to write the same piece of JavaScript in a map or reduce function that I've used elsewhere already, I get bitter about that. Sometimes it's just very basic stuff that I could easily solve by throwing an existing library at it, but instead I'm cluttering my view code with it over and over again. And yes, there's the !code placeholder, but it's not about throwing an undebuggable mess of code into my view function, it's about not repeating myself. !code doesn't really solve that problem good enough for me.

Word is that it's got something to do with determining whether files have updated or not, but hey CouchDB, why don't you let me worry about that and let me tell you when I think a file I've included through CommonJS has been updated? I would very much appreciate that.

No link-walking between documents

With CouchDB 0.11, map functions got a way to emit other documents using {_id: doc.other_id}, but that still doesn't allow full access to e.g. attributes of said documents. Sometimes that'd just be handy to have. Sure, you could use embedded documents, but in that case it'd just be a dumb workaround, where I could just have a way to fetch a document by its identifier and throw some of its attributes at the map function.

Say what you will though, just being able to emit other documents is still pretty cool. Makes querying and fetching associated documents a bit easier.

All reads go to disk

CouchDB doesn't cache anything. It does delay commits if you want it to, so that it doesn't hit the disk on every document update, but it sure as heck doesn't cache anything in memory. This is both curse and blessing. It keeps the memory footprint of CouchDB incredibly small, no doubt. Considering they're targeting mobile devices it makes a lot of sense, plus, accessing flash-based storage is a lot cheaper than spinning disks.

But, on the other hand, when I have the memory available, why not use it? I know caching is a hard problem to solve. CouchDB is also made for high concurrency, no doubt, but my disks aren't necessarily. Sure, I could buy faster disks, but if you really think about it, memory is the new disk, plus, tell Amazon to offer faster network storage for EC2, please do, maybe that'd already help. CouchDB somewhat relies on the file system cache doing its magic to speed up things, but I really don't want to rely on magic. You could put an HTTP-level reverse proxy like Varnish in front of CouchDB though, that'd be a feasable option, but that adds another layer to your infrastructure.

In all seriousness, I'd love to see some caching introduced in CouchDB. I won't say it's an easy feature to implement, because it sure isn't, but it doesn't need to be something fancy, I just would like to see CouchDB use some of my memory for data that's read more often than it's written. But until then, Varnish to the rescue!

Error messages are not helping

I'm just gonna post the following snippet from my CouchDB log file, and leave you to it. You tell me how useful it is. Suffice it to say, I just wish CouchDB would not dump all that Erlang trace into my log, but maybe a useful error message for a change. It works in some cases, but a lot of times, when the problem usually is as simple as a permissions problem, you're left scratching your head.


The End

There you go, some annoying things about CouchDB. They're annoying, but I still like CouchDB a lot. It's stuff I can live it, it's stuff I can work around, it's stuff that doesn't have as big an effect in production as it may seem. The bottom line is, as always, evaluate your tools. The above list is not to be taken as a list of arguments purely against using CouchDB. Consider them a list of things you need to be aware of, that may or may not be acceptable compared to what you gain.

In the end, and any way you look at it, CouchDB still kicks butt.

Tags: couchdb