For an article in a German magazine I've been researching MongoDB over the last week or so. While I didn't need a lot of the information I came across I collected some nicely distilled notes on some of its inner workings. You won't find information on how to get data out of or into MongoDB. The notes deal with the way MongoDB treats and handles your data, a high-low-level view if you will. I tried to keep them as objective as possible, but I added some commentary below.

Most of this is distilled knowledge I gathered from the MongoDB documentation, credit for making such a good resource available for us to read goes to the Mongo team. I added some of my own conclusion where it made sense. They're doing a great job documenting it, and I can highly recommend spending time to go through as much of it as possible to get a good overview of the whys and hows of MongoDB. Also, thanks to Mathias Stearn for hooking me up with some more details on future plans and inner workings in general. If you want to know more about its inner workings, there's a webcast coming up where they're gonna explain how it works.

Basics

  • Name stems from humongous, though (fun fact) mongo has some unfortunate meanings in other languages than English (German for example)
  • Written in C++.
  • Lots of language drivers available, pushed and backed by the MongoDB team. Good momentum here.
  • According to The Changelog Show ([1]) MongoDB was originally part of a cloud web development platform, and at some point was extracted from the rest, open sourced and turned into what it is today.

Collections

  • Data in MongoDB is stored in collections, which in turn is stored in databases. Collections are a way of storing related data (think relational tables, but sans the schema). Collections contain documents which have in turn keys, another name for attributes.
  • Data is limited to around 2 GB on 32-bit systems, because MongoDB uses memory-mapped files, as they're tied to the available memory addressing. (see [2])
  • Documents in collections usually have a similar data structure, but any arbitrary kind of document could be stored, similarity is recommended for index efficiency. Document's can have a maximum size of 4MB.
  • Collections can be namespaced, i.e. logically nested: db.blog.posts, but the collection is still flat as far as MongoDB is concerned, purely an organizational means. Indexes created on a namespaced collection only seem to apply to the namespace they were created on though.
  • A collection is physically created as soon as the first document is created in it.
  • Default limit on number of namespaces per database is 24000 (includes all collections as they're practically the top level namespace in a database), which also includes indexes, so with the maximum of 40 indexes applied to each collection you could have 585 collections in a database. The default can be changed of course, but requires repairing the database if changed on an active instance.
  • While you can put all your data into one single collection, from a performance point of view, it seems to make sense to separate them into different collections, because it allows MongoDB to keep its indexes clean, as they won't index attributes for totally unrelated documents.

Capped Collections

  • Capped collections are fixed-size collections that automatically remove aged entries by LRU. Sounds fancier than it probably is, I'm thinking that documents are just appended at the last writing index, which is reset to 0 when limit of the collection is reached. Preferrable for insert-only use cases, updates of existing documents fail when the data size is larger than before the update. This makes sense because moving an object would destroy the natural insertion order. Limited to ~1GB on 32-bit systems, sky's the limit on 64-bit.
  • Capped collections seem like a good tool for logging data, well knowing that old data is purged automatically, being replaced with new data when the limit is reached. Documents can't be deleted, only the entire collection can be dropped. Capped collections have no indexes on the _id by default, ensuring good write performance. Indexes generally not recommended to ensure high write performance. No index on _id means that walking the collection is preferred over looking up by a key.
  • Documents fetched from a capped collection are returned in the order of their insertion, newest first, think log tailing.

Data Format

  • Data is stored and queried in BSON, think binary-serialized JSON-like data. Features are a superset of JSON, adding support for regular expressions, date, binary data, and their own object id type. All strings are stored in UTF-8 in BSON, sorting on the other hand does not (yet), it uses strcmp, so the order might be different from what you'd expect. There's a sort of specification for BSON, if you're into that kind of stuff: [3] and [4]
  • Documents are not identified by a simple ID, but by an object identifier type, optimized for storage and indexing. Uses machine identifier, timestamp and process id to be reasonably unique. That's the default, and the user is free to assign any value he wishes as a document's ID.
  • MongoDB has a "standard" way of storing references to other documents using the DBRef type, but it doesn't seem to have any advantages (e.g. fetch associated objects with parent) just yet. Some language drivers can take the DBRef object and dereference it.
  • Binary data is serialized in little-endian.
  • Being a binary format, MongoDB doesn't have to parse documents like with JSON, they're a valid in-memory presentation already when coming across the wire.

References

  • Documents can embed a tree of associated data, e.g. tags, comments and the like instead of storing them in different MongoDB documents. This is not specific to MongoDB, but document databases in general (see [5]), but when using find you can dereference nested objects with the dot, e.g. blog.posts.comments.body, and index them with the same notation.
  • It's mostly left to the language drivers to implement automatic dereferencing of associated documents.
  • It's possible to reference documents in other databases.

Indexes

  • Every document gets a default index on the _id attribute, which also enforces uniqueness. It's recommended to index any attribute that's being queried or sorted on.
  • Indexes can be set on any attribute or embedded attributes and documents. Indexes can also be created on multiple attributes, additionally specifying a sort order.
  • If an array attribute is indexed, MongoDB will indexed all the values in it (Multikeys).
  • Unique keys are possible, missing attributes are set to null to ensure a document with the same missing attribute can only be stored once.
  • If it can, MongoDB will only update indexes on keys that changed when updating a document, only if the document hasn't changed in size so much that it must be moved.
  • MongoDB up to 1.2 creates and updates synchronously, 1.3 has support to update indexes in the background

Updates

  • Updates to documents are in-place, allowing for partial updates and atomic operations on attributes (set for all attributes, incr, decr on numbers, push, pop, pull et. al on arrays), also known as modifier operations. If an object grows out of the space originally allocated for it, it'll be moved, which is obviously a lot slower than updating in-place, since indexes need to be updated as well. MongoDB tries to adapt by allocating based on update history (see [6]). Writes are lazy.
  • Not using any modifier operation will result in the full document being updated.
  • Updated can be done with criteria, so a whole bunch of matching documents. Think "update ... where" in SQL. This allows for updating objects based on a particular snapshot, i.e. update based on id and some value in the criteria will only update when the document still has that value. This kind of update is atomic. Reliably updating multiple documents atomically (think transaction) is not possible. There's also findAndModify in 1.3 (see [7]) which allows atomically updating and returning a document.
  • Upserts insert when a record with the given criteria doesn't exist, otherwise updates the found record. They're executed on the collection. A normal save() will do that automatically for any given document. Think find_or_create_by in ActiveRecord.

Querying

  • Results are returned as cursors, walking a collection as it advances. Which explains why you potentially get records that needed to be moved, it pops up in a space that's potentially after its current position, if there's space even in a spot before the current cursor's position. Cursors are fetched in batches of 100 documents or 4 MB of data, whichever's reached first.
  • That's also why it's better to store similar data in a separate collection. Traversing similar data is cheaper than traversing over totally unrelated data, the bigger the size of documents compared to the documents that match your find, the more data will have to be fetched from the database and skipped if it doesn't match your criteria.
  • Data is returned in natural order which doesn't necessarily relate to insertion order, as data can be moved if it doesn't fit into its old spot anymore when updated. For capped collections, natural order is always insertion order.

Durability

  • By default, data in MongoDB is flushed to disk every 60 seconds. Writes to MongoDB (i.e. document creates, updates and deletes) are not stored on disk until the next sync. Tradeoff high write performance vs. durability. Need more durability, reduce sync delay. Closest comparison to the durability behaviour is MySQL's MyISAM.
  • Data is not written transactional, so if the server is killed during a write operation, the data is likely to be inconsistent or even corrupted and needs repair. Think classic file systems like ext2 or MyISAM.
  • In MongoDB 1.3 a database flush to disk can be enforced by sending the fsync command to the server.

Replication

  • Replication is the recommended way of ensuring data durability and failover in MongoDB. A new (i.e. bare and dataless) instance can be hooked onto another at any time, doing an initial cloning of all data, fetching only updates after that.
  • Replica pairs offers an auto-failover mechanism. Initially both settle on which is master and which is slave, the slave taking over should the master go down. Can be used e.g. in the Ruby driver using :left and :right options. There's an algorithm to handle changes when master and slave get out of sync, but it's not fully obvious to me (see [8]). Replica Pairs will be replaced by Replica Sets, allowing for more than one slave. The slave with the most recent data will be promoted master in case of the master going down. The slaves agree which one of them is the new master, so a client could ask any one server in the set which one of them is the master.
  • Replication is asynchronous, so updates won't propagate immediately to the slaves. There's ideas to require the right to be propagated to at least N slaves before returning the write to the client successfully (similar to the feature in MySQL 5.4). (see [9])
  • A master collects its writes in an opslog on which the slaves simply poll for changes. The opslog is a capped collection and therefore not a fully usable transaction log (not written to disk?) as old data is purged automatically, hence not reliable for restoring the database after a crash.
  • After initial clone, slaves poll once on the full opslog, subsequent polls remember the position where the previous poll ended.
  • Replication is not transactional, so the durability of the data on the slave is prone to the same durability conditions as the master, just in a different and still durability-increasing manner, since having a slave allows to decrease sync times on it, and therefore shortening the timespan of data not being written to disk across the setup.

Caching

  • With the default storage engine, caching is basically handled by the operating system's virtual memory manager, since it uses memory-mapped files. File cache == Database cache
  • Caching behaviour relies on the operating system, and can vary, not necessarily the same on every operating system.

Backup

  • If you can live with a temporary write lock on your database, MongoDB 1.3 offers fsync with lock to take a reliable snapshot of the database's files.
  • Otherwise, take the old school way of dumping the data using mongodump, or snapshotting/dumping from a slave database.

Storage

  • Data is stored in subsequently numbered data files, each new one being larger than the former, 2GB being the maximum size a data file can have.
  • Allocation of new datafiles doesn't seem to be exactly related to the amount of data currently being stored. E.g. storage size returned by MongoDB for a collection was 2874825392 bytes, but it had already created almost six gigabytes worth of database files. Maybe that's the result of padding space for records. I haven't found a clear documentation on this behaviour.
  • When MongoDB moves data into a different spot or deletes documents, it keeps track of the free space to reuse in the future. The command repairDatabase() can be use to compact it, but that's a slow and blocking operation.

Concurrency

  • MongoDB refrains from using any kind of locking on data, it has no notion of a transaction or isolation levels. Concurrent writes will simply overwrite each other's data, as they go straight to memory. Exceptions are modifier operations that are guaranteed to be atomic. As there is no way to update multiple records in some sort of transaction, optimistic locking is not possible, at least in a fully reliable way. Since writes are in-place and in-memory first, they're wicked fast.
  • Reads from the database are usually done in cursors, fetching a batch of documents lazily while iterating through it. If records in the cursors are updated while the cursor is being read from, the updated data may or may not show up. There's no kind of isolation level (as there are no locks or snapshotting). Deleted records will be skipped. If a record is updated from another process so that the size increases and the object has to be moved to another spot there's a chance it's returned twice.
  • There's snapshot queries, but even they may or may not return inserted and deleted records. They do ensure that even updated records will be returned only once, but are slower than normal queries.

Memory

  • New data is allocated in memory first, increments seem to be fully related to the amount of data saved.
  • MongoDB seems to be happy to hold on to whatever memory it can get, but at least during fsync it frees as much as possible. Sometimes it just went back to consuming about 512 MB real memory, other times it went down to just a couple of megs, I couldn't for the life of me make out a pattern.
  • When a new database file needs to be created, it looks like MongoDB is forcing all data to be flushed to disk, freeing a dramatic amount of memory. On normal fsyncs, there's no real pattern as to how MongoDB frees memory.
  • It's not obvious how a user can configure how much memory MongoDB can or should use, I guess it's not possible as of now. Memory-mapped files probably just use whatever's available, and be cleaned up automatically by the operating system's virtual memory system.
  • The need to add an additional caching layer is reduced, as object and database representation is the same, and file system and memory cache can work together to speed up access, there's no data conversion involved, at least not on MongoDB's side, data will just be sent serialized and unparsed across the wire. Obviously it depends on the use case if this is really an advantage or a secondary caching layer is still needed.

GridFS

  • Overcomes the 4MB limit on documents by chunking larger binary objects into smaller bits.
  • Can store metadata alongside file data. Metadata can be specified by the user and be arbitrary, e.g. contain access control information, tags, etc.
  • Chunks can be randomly access, so it's possible to fetch data easily whose position in the file is well-known. If random access is required, makes sense to keep chunks small. Default size is 256K.

Protocol Access

  • MongoDB's protocol is binary and in its own right proprietary, hence they offer a lot of language drivers to take that pain away from developers, but also offer a full specification on both BSON and the protocol.

Sharding

  • MongoDB has alpha support for sharding. Its functionality shouldn't be confused with Riak's way of partitioning, it's a whole different story. The current functionality is far from what is planned for production, so take everything listed here with a grain of salt, it merely presents the current state. The final sharding feature is supposed to be free of the restrictions listed here.
  • A shard ideally (but not necessarily) consists of two servers which form a replica pair, or a replica set in the future.
  • All shards are known to a number of config server instances that also know how and where data is partitioned to.
  • Data can be sharded by a specific key. That key can't be changed afterwards, neither can the key's value.
  • Keys chosen should be granular enough so that there's the potential of having too many records with the same key. Data is split into chunks of 50 MB so with big documents, it's probably better to store them in GridFS, as a chunk can contain a minimum of ~12 documents when all take up the available space of 4 MB.
  • Sharding is handled by a number of mongos instances which are connected to the shards which in turn are all known to a number of mongod config server instances. These can run on the same machines as the data-handling mongod instances, with the risk that when the servers go down they also disappear. Having backup services seems to be appropriate in this scenario.
  • Sharding is still in alpha, e.g. currently replicated shards aren't supported in alpha 2, so a reliable sharding setup is currently not possible. If a shard goes down, the data on it is simply unavailable until it's brought back up. Until that happens, all reads will raise an error, even when looking up data that's known to be on the still available shards.
  • There's no auto-balancing to move chunks to new shards, but that can be done manually.

There you go. If you have something to add or to correct, feel free to leave a comment. I'm happy to stand corrected should I have drawn wrong conclusions anywhere.

As a user of CouchDB I gotta say, I was quite sceptical about some of MongoDB's approaches of handling data. Especially durability is something that I was worried about. But while I read through the documentation and played with MongoDB I realized that it's the same story as always: It depends. It's a problem when it's a problem. CouchDB and MongoDB don't necessarily cover the same set of use cases. There's enough use cases where the durability approach of MongoDB is acceptable compared to what you gain, e.g. in development speed, or speed when accessing data, because holy crap, that stuff is fast. There's a good reason for that, as I hope you'll agree after going through these notes. I'm glad I took the time to get to know it better, because the use cases kept popping up in my head where I would prefer it over CouchDB, which isn't always a sweet treat either.

If you haven't already, do give MongoDB a spin, go through their documentation, throw data at it. It's a fun database, and the entrance barrier couldn't be lower. It's a good combination of relational database technologies, with schemaless and JavaScript sprinkled on top.

Tags: nosql, mongodb

June was an exhausting month for me. I spoke at four different conferences, two of which were not in Berlin. I finished the last talk today, so time to reciprocate on conferences and talks. In all I had good fun. It was a lot of work to get the presentations done (around 400 single slides altogether), but in all I would dare say that it was all more than good practice to work on my presentation skills and to loose a bit of the fear of talking in front of people. But I'll follow up on that stuff in particular in a later post.

RailsWayCon in Berlin

I have to admit that I didn't see much of the conference, I mainly hung around, talked to people, and gave a talk on Redis and how to use it with Ruby. Like last year the conference was mingled in with the International PHP Conference and the German Webinale, a somewhat web-related conference. I made a pretty comprehensive set of slides for Redis, available for your viewing pleasure.

Berlin Buzzwords in Berlin

Hadoop, Lucene, NoSQL, Berlin Buzzwords had it all. I spent most of my time in the talks on the topics around NoSQL, having been given the honor of opening the track with a general introduction on the topic. I can't remember having given a talk in front of this many people. The room took about 250, and it seemed pretty full. Not tooting my own horn here, I've never been more anxious before a talk of how it would go. Obviously there were heaps of people in the room who have only heard of the term, and people who work with or on the tools on a daily basis. Feedback was quite positive, so I guess it turned out pretty okay. Rusty Klophaus wrote two very good recaps of the whole event, read on about day one and day two.

The slide set for my talk has some 120 slides in all, trying to give a no-fuss overview of the NoSQL ecosystem and the ideas and inspirations. There's some historical references in the talk, because in general the technologies aren't revolutionary, they use ideas that've been around for a while and combine them with some newer ones. Do check out the slides for some more details on that.

MongoUK in London

10gen is running MongoDB related conferences in a couple of cities, one of them in London, where I was asked to speak on something related to MongoDB. Since I'm all about diversity, that's pretty much what I ended up talking about, with a hint of MongoDB sprinkled on top of it. Document databases, the web, the universe, all the philosophical foundation knowledge you could ask for. I talked about CouchDB, Riak, and about what makes MongoDB stand out from the rest.

Most enjoyable about MongoUK was to hear about real life experiences of MongoDB users, what kind of problems they had and such. Also, I finally got to see some of London and meet friends, but I'll write more about that (and coffee) on my personal blog. Again, the slide set is available for your document database comparison pleasure.

Cloud Expo Europe in Prague

Just two 36 hours after I got back from London I jumped on the train to Prague to speak about MongoDB at Cloud Expo Europe. Cloud is something I can get on board with (hint: Scalarium), so why the hell not? It turned out to be a pretty enterprisey conference, but still, got some new food for thought on cloud computing in general.

I already gave a talk on MongoDB at Berlin's Ruby brigade, but I built a different slide set this time, improving on the details I found to be a bit confusing at first. Do check out the slides, if you don't know anything about MongoDB yet, it should give you a good idea.

Showing off

As you'll surely notice, my slides are all websites, and not on Slideshare. Two months ago I looked into Scott Chacon's Showoff, a tool to build web-based presentations that simply run as tiny JavaScript apps in the browser. I very much like that idea, because even though Keynote is still the king of the crop, it's still awful. Using Markdown, CSs and JavaScript appeals much more to the geek in my. It's so easy to crank out slides as simple text, and worry about the styling later. Plus, I can easily keep my slides in Git, and who doesn't enjoy that? I'd very much recommend giving it a go. If you want to look at some sources, all my talks and their sources are available on the GitHubs, MongoDB, Redis, NoSQL, document databases and again MongoDB.

It's a pleasure to build slides with Showoff, and it has helped me focus my slides on very short phrases and as few bullet points as possible. Sure, it's not Keynote and doesn't have all the fancy features, but I noticed that it forced me to focus more, and that keeping slides short helped me stay focussed, but again, more on that in a follow-up post.

Feel free to use my slides as inspiration to play with Showoff, there's surprisingly little magic involved. Also, if you think I should speak at a conference you know of or that you're organising, do get in touch.

Update: Read the comments and below. The issue is not as bad as it used to be in the documentation and the original design, thankfully.

A lot has happened since I've first written about MongoDB back in February. Replica Pairs are going to be deprecated, being replaced by Replica Sets, a working Auto-Sharding implementation, including rebalancing shards, and lots more, all neatly wrapped into the 1.6 release.

The initial draft on how they'd turn out sounded good, but something struck me as odd, and it is once again one of these things that tend to be overlooked in all the excitement about the new features. Before we dive any deeper, make sure you've read the documentation, or check out this rather short introduction on setting up a Replica Set, I won't go into much detail on Replica Sets in general, I just want to point out one major issue I've found with them. Part of the documentation sheds some light on the inner workings of Replica Sets. It's not exhaustive, but to me more interesting than the rest of the documentation.

One part struck me as odd, the paragraph on resyncing data from a new primary (as in master). It's two parts actually, but they pretty much describe the same caveat:

When a secondary connects to a new primary, it must resynchronize its position. It is possible the secondary has operations that were never committed at the primary. In this case, we roll those operations back.

Also:

When we become primary, we assume we have the latest data. Any data newer than the new primary's will be discarded.

Did you notice something? MongoDB rolls operations back that were never committed to the primary, discarding the updated data, which is just a fancy term for silently deleting data without further notice. Imagine a situation where you just threw a bunch of new or updated data at your current master, and the data has not yet fully replicated to all slaves, when suddenly your master crashes. According to the protocol the node with the most recent opslog entries takes over the primary's role automatically.

When the old master comes back up, it needs to resynchronize the changes from the current master, before it can play any role in the set again, no matter if it becomes the new primary, or sticks to being a secondary, leaving the new master in place. During that resync it discards data that has not been synchronized to the new master yet. If the opslog on the new master was behind a couple of dozen entries before the old one went down, all that data is lost. I repeat: lost. Think about that.

There's ways to reduce the pain, and I appreciate that they're mentioned appropriately in the documentation. You can tell MongoDB to consider a write successful when it replicated to a certain number of secondaries. But you have to wait until that happened, polling getLastError() for the state of the last operation. Or you could set maxLag accordingly, so that the master will fail or block a write until the secondaries catch up with the replication, though I couldn't for the life of me figure out (using the Googles) where and how to set it.

But I don't approve of this behavior as a default, and the fact that you need to go through the internals to find out about it. Everything else suggests that there's no point of failure in a MongoDB setup using sharding and Replica Sets, even comparing it to the Dynamo way of guaranteeing consistency, which it simply isn't when the client has to poll for a successful write.

It's one of those things that make me reconsider my (already improved) opinions on MongoDB all over again, just when I started to warm up with it. Yes, it's wicked fast, but I simply disagree with their take on durability and consistency. The tradeoff (as in: losing data) is simply too big for me. You could argue that these situations will be quite rare, and I would not disagree with you, but I'm not fond of potentially losing data when they do happen. If this works for you, cool! Just thought you should know.

Update: There's been some helpful comments by the MongoDB folks, and there's good news. Data is not silently discarded in 1.6 anymore, apparently it's stored in some flat file, fixed with this issue, though it's hard for me to say from the commits what exactly happens. The documentation does not at all reflect these changes, but improvements are on the way. I'm still not happy about some of the design decision, but they're rooted in the way MongoDB currently works, and changing that is unlikely to happen, but at least losing data doesn't seem to be an option anymore. If making a bit of a fool out of myself helped to improve on the documentation front, so be it. I can live with that.

Last weekend I tweeted two links to two tweets by a poor guy who apparently got his MongoDB database into an unrecoverable state during shutdown whilst upgrading to a newer version. That tweet quickly made the rounds, and the next morning I saw myself staring at replies stating that it was all his fault, because he 1.) used kill -9 to shut it down because apparently the process hung (my guess is it was in the middle of flushing all data to disk) and 2.) didn't have a slave, just one database instance.

Others went as far as indirectly calling him an idiot. Oh interwebs, you make me sad. If you check out the thread on the mailing list, you'll notice a similar pattern in reasoning. The folks over at http://learnmongo.com seem to want to be the wittiest of them all, recommending to always have a recent backup, a slave or replica set and to never kill -9 your database.

While you can argue that the guy should've known better, there's something very much at odds here, and it seems to become a terrifying meme with fans of MongoDB, the idea that you need to do all of these things to get the insurance of your data being durable. Don't have a replica? Your fault. kill -9 on a database, any database? You mad? Should've read the documentation first, dude. This whole issue goes a bit deeper than just reading documentation, it's the fundamental design decision of how MongoDB treats your data, and it's been my biggest gripe from the get go. I can't help but be horrified by these comments.

I've heard the same reasoning over and over again, and also that it just hasn't happened so far, noone's really lost any considerable data. The problem is, most people never talk about it publicly, because it's embarrassing, best proof is the poor guy above. This issue is not even related to MongoDB, it's a general problem.

Memory-Mapped Persistence

But let me start at the beginning, MongoDB's persistence cycle, and then get to what's being done to improve its reliability and your data's durability. At the very heart, MongoDB uses memory-mapped files to store data. A memory-mapped file is a data structure that has the same representation on disk as it has when loaded into memory. When you access a document in MongoDB, loading it from disk is transparent to MongoDB itself, it can just go ahead and write to the address in memory, as every database in MongoDB is mapped to a dynamically allocated set of files on disk. Note that memory-mapped files are something you won't find in a lot of other databases, if any at all. Most do their own house-keeping and use custom data structures for that purpose.

The memory mapping library (in MongoDB's case the POSIX functions, and whatever Windows offers in that area) will take care of handling the flush back to disk every 60 seconds (configurable). Everything in between happens solely in memory. Database crash one second before the flush strikes again? You just lost most of the data that was written in the last 59 seconds. Just to be clear, the flushing cycle is configurable, and you should consider choosing a better value depending on what kind of data you're storing.

MongoDB's much praised insert speed? This is where it comes from. When you write stuff directly to local memory, they better be fast. The persistence cycle is simple: accept writes for 60 seconds, then flush the whole thing to disk. Wait for another 60 seconds, then flush again, and so on. Of course MongoDB also flushes the data when you shut it down. But, and here's the kicker, of course that flush will fail when you kill it without mercy, using the KILL signal, just like the poor guy above did apparently. When you kill something that writes a big set binary data to disk, all bets are off. One bit landing on the wrong foot and the database can get corrupted.

Database Crashes are Unavoidable

This scenario can and does happen in e.g. MySQL too, it even happens with CouchDB, but the difference is, that in MySQL you usually only have a slightly damaged region, which can be fixed by deleting and re-inserting it. In CouchDB, all that happens is that your last writes may be broken, but CouchDB simply walks all the way back to the last successful write and runs happily ever after.

My point here is simple: even when killed using the KILL signal, a database should not be unrecoverable. It simply shouldn't be allowed to happen. You can blame the guy all you want for using kill -9, but consider the fact that it's the process equivalent of a server or even just the database process crashing hard. Which happens, believe it or not.

Yes, you can and probably will have a replica eventually, but it shouldn't be the sole precondition to get a durable database. And this is what horrifies me, people seem to accept that this is simply one of MongoDB's trade-offs, and that it should just be considered normal. They shouldn't, it needs more guys like the one causing all the stir bringing up these isses, even though it's partly his fault, to show the world what can happen when worse comes to worst.

People need to ask more questions, and not just accept answers like: don't use kill -9, or always have a replica around. Servers crash, and your database needs to be able to deal with it.

Durability Improvements in MongoDB 1.7/1.8

Now, the MongoDB folks aren't completely deaf, and I'm happy to report they've been working on improvements in the area of data durability for a while, and you can play with the new durability option in the latest builds of the 1.7 branch, and just a couple of hours ago, there was activity in improving the repair tools to better deal with corrupted databases. I welcome these changes, very much so. MongoDB has great traction, a pretty good feature set, and the speed seems to blow peoples' minds. Data durability has not been one of its strengths though, so I'm glad there's been a lot of activity in that area.

If you start the MongoDB server with the new --dur option, and it will start keeping a journal. When your database crashed, the journal is simply replayed to restore all changes since the last successful flush. This is not a particularly special idea, because it's how your favorite relation database has been working for ages, and not unsimilar to the storage model of other databases in the NoSQL space. It's a good trade-off between keeping good write speed and getting a much more durable dataset.

When you kill your database harshly in between flushes with a good pile of writes in between, you don't lose a lot of data anymore, maybe a second's worth (just as you do with MySQL when you use InnoDB's delayed flushing), if any at all, but not much more than that. Note that these are observation based on a build that's now already more than a month old. Situation may have improved since then. Operations are put into a buffer in memory, from where they're both logged to disk into the journal, and then applied to the dataset. When writing the data to memory, it has already been written to the journal. Journals are rotated once they reach a certain size and it's ensured that all their data has been applied to the dataset.

A recovery process applies all uncommitted changes from the log when the database crashes. This way it's ensured that you only lose a minimum set of data, if none at all, when your database server crashes hard. In theory the journal could be used to restore a corrupted in a scenario as outlined above, so it's pretty neat in my opinion. Either way, the risk of losing data is now pretty low. In case your curious for code, the magic happens in this method.

I for one am glad to see improvements in this area of MongoDB, and I'm secretly hoping that durable will become the default mode, though I don't see it happening for marketing reasons anytime soon. Also, be aware that durability brings more overhead. In some initial tests however, the speed difference between non-durable and durable MongoDB was almost not worth mentioning, though I wouldn't call them representative, but in general there's no excuse to not use it really.

It's not yet production ready, but nothing should keep you from playing with it to get an idea of what it does.

Bottom Line

It's okay to accept trade-offs with whatever database you choose to your own liking. However, in my opinion, the potential of losing all your data when you use kill -9 to stop it should not be one of them, nor should accepting that you always need a slave to achieve any level of durability. The problem is less with the fact that it's MongoDB's current way of doing persistence, it's with people implying that it's a seemingly good choice. I don't accept it as such. If you can live with that, which hopefully you don't have to for much longer anyway, that's fine with me, it's not my data anyway. Or maybe I'm just too paranoid.