Fork me on GitHub

By now it should be obvious that I'm quite fond of alternatives data stores (call them NoSQL if you must). I've given quite a few talks on the subjects recently, and had the honor of being a guest on the (German) heise Developer Podcast on NoSQL.

There's some comments and questions that pop up every time alternative databases are being talked about, especially by people deeply rooted in relational thinking. I've been there, and I know it requires some rethinking, and also am quite aware that there are some controversial things that basically are the exact opposite of everything you learned in university.

I'd like to address a couple of those with some commentary and my personal experience (Disclaimer: my experience is not the universal truth, it's simply that: my experience, your mileage may vary). When I speak of things done in practice, I'm talking about how I witnessed things getting done in Real Lifeā„¢, and how I've done them myself, both good and bad. I'm focussing on document databases, but in general everything below holds true for any other kind of non-relational database.

It's easy to say that all the nice features document databases offer are just aiming for one thing, to scale up. While that may or may not be true, it just doesn't matter for a lot of people. Scaling is awesome, and it's a problem everyone wants to solve, but in reality it's not the main issue, at least not for most people. Also, it's not an impossible thing to do even with MySQL, I've had my fun doing so, and it sure was an experience, but it can be done.

It's about getting stuff done. There's a lot more to alternative databases in general, and document databases in particular, that I like, not just the ability to scale up. They simply can make my life easier, if I let them. If I can gain productivity while still being aware of the potential risks and pitfalls, it's a big win in my book.

What you'll find, when you really think about it, is that everything below holds true no matter what database you're using. Depending on your use case, it can even apply to relational databases.

Relational Databases are all about the Data

Yes, they are. They are about trying to fit your data into a constrained schema, constrained in length, type, and other things if you see fit. They're about building relationships between your data in a strongly coupled way, think foreign key constraints. Whenever you need to add data, you need to migrate your schema. That's what they do. They're good at enforcing a set of ground rules on your data.

See where I'm going with this? Even though relational databases tried to be a perfect fit for data, they ended up being a pain once that data needed to evolve. If you haven't felt that pain yet, good for you. I certainly have. Tabular data sounds nice in theory, and is pretty easy to handle in Excel, but in practice, it causes some pain. A lot of that pain stemmed from people using MySQL, yes, but take that argument to the guy who wrote it and sold it to people as the nicest and simplest SQL database out there.

It's easy to get your data into a schema once, but it gets a lot harder to change the schema and the data into a different schema at a later point in time. While data sticks around, the schema evolves constantly. Something relational databases aren't very good at supporting.

Relational Databases Enforce Data Consistency

They sure do, that's what they were built for. Constraints, foreign keys, all the magic tricks. Take Rails as a counter-example. It fostered the idea that all that stuff is supposed to be part of the application, not the database. Does it have trade-offs? Sure, but it's part of your application. In practice, that was correct, for the most part, although I can hear a thousand Postgres users scream. There's always an area that requires constraints on the database level, otherwise they wouldn't have been created in the first place.

But most web applications can live fine without it, they benefit from being free about their data, to shape it in whichever way they like, adding consistency on the application level. The consistency suddenly lies in your hands, a responsibility not everyone is comfortable with. You're suddenly forced to think more about edge cases. But you sure as hell don't have to live without consistent data, quite the opposite. The difference is that you're taking care of the consistency yourself, in terms of your use case, not using a generic one-fits-all solution.

Relationships between data aren't always strict. They can be loosely linked, what's the point of enforcing consistency when you don't care if a piece of data still exists or not? You handle it gracefully in your application code if you do.

SQL is a Standard

The basics of SQL are similar, if not the same, but under the hood, there's subtle differences. Why? Because under the hood, every relational database works differently. Which is exactly what document databases acknowledge. Every database is different, trying to put a common language on top will only get you so far. If you want to get the best out of it, you're going to specialize.

Thinking in Map/Reduce as CouchDB or Riak force you to is no piece of cake. It takes a while to get used to the ideas around it and what implications it has for you and your data. It's worth it either way, but sometimes SQL is just a must, no question. Business reporting can be a big issue, if your company relies on supporting standard tools, you're out of luck.

While standards are important, in the end it's important what you need to do with your data. If a standard gets in your way, how is that helpful? Don't expect a standard query language for document databases any time soon. They all solve different types of problems in different ways, and they don't intend to hide that from you with a standard query language. If on the other hand, all you need is a dynamic language for doing ad-hoc queries, check out MongoDB.

Normalized Data is a Myth

I learned a lot in uni about all the different kinds of normalization. It just sounded so nice in theory. Model your data upfront, then normalize the hell out of it, until it's as DRY as the desert.

So far so good. I noticed one thing in practice: Normalized data almost never worked out. Why? Because you need to duplicate data, even in e-commerce applications, an area that's traditionally mentioned as an example where relational databases are going strong.

Denormalizing data is simply a natural step. Going back to the e-commerce example, you need to store a lot of things separately when someone places an order: Shipping and billing address, payment data used, product price and taxes, and so on. Should you do it all over the place? Of course not, not even in a document database. Even they encourage storing similar data to a certain extent, and with some of them, it's simply a must. But you're free to make these decisions on your own. They're not implying you need to stop normalizing, it still makes sense, even in a document database.

Schemaless is not Schemaless

But there's one important thing denormalization is not about, something that's being brought up quite frequently and misunderstood easily. Denormalization doesn't mean you're not thinking about any kind of schema. While the word schemaless is brought up regularly, schemaless is simply not schemaless.

Of course you'll end up with having documents of the same type, with a similar set of attributes. Some tools, for instance MongoDB, even encourage (if not force) you to store different types of documents in different collections. But here's the kicker, I deliberately used the word similar. They don't need to be all the same across all documents. One document can have a specific attribute, the other doesn't. If it doesn't, just assume it's empty, it's that easy. If it needs to be filled at some point, write data lazily, so that your schema eventually is complete again. It's evolving naturally, which does sound easy, but in practice requires more logic in your application to catch these corner cases.

So instead of running migrations that add new tables and columns, and in the end pushing around your data, you migrate the data on the next access, whether that's a read or a write is up to your particular use case. In the end you simply migrate data, not your schema. The schema will evolve eventually, but first and foremost, it's about the data, not the constraints they live in. The funny thing: In larger projects, I ended up doing the same thing with a relational database. It's just easier to do and gentler on the load than running a huge batch job on a production database.

No Joins, No Dice

No document database supports joins, simple like that. If you need joins, you have two options: Use a database that supports joins, or adapt your documents so that they remove the need for joins.

Documents have one powerful advantage: It's easy to embed other documents. If there's data you'd usually fetch using a join, and that'd be suitable for embedding (and therefore oftentimes: denormalizing), there's your second option. Going back to the e-commerce example: Whereas in a relational database you'd need a lot of extra tables to keep that data around (unless you're serializing it into single column), in a document database you just add it as embedded data to the order document. You have all the important data one in place, and you're able to fetch it in one go. Someone said that relational databases are a perfect fit for e-commerce. Funny, I've worked on a market platform, and I've found that to be a ludicrous statement. I'd have benefited from a loser data storage several times, joins be damned.

It's not always viable, sure, and it'd be foolish to stick with a document database if that's an important criterion for your particular use case, then no dice. It's relational data storage or bust.

Of course there's secret option number three, which is to just ignore the problem until it's a problem, just by going with a document database and see how you go, but obviously that doesn't come without risks. It's worth noticing though that Riak supports links between documents, and even fetching linked documents together with the parent in one request. In CouchDB on the other hand, you can emit linked documents in views. You can't be fully selective about the document data you're interested in, but if all you want is fetch linked documents, there is one or two ways to do that. Also, graph databases have made it their main focus to make traversal of associated documents an incredibly cheap operation. Something your relational database is pretty bad at.

Documents killed my Model

There's this myth that you just stop thinking about how to model your data with document databases or key-value storage. That myth is downright wrong. Just because you're using schemaless storage doesn't mean you stop thinking about your data, quite the opposite, you think even more about it, and in different ways, because you simply have more options to model and store it. Embedding documents is a nice luxury to have, but isn't always the right way to go, just like normalizing the crap out of a schema isn't always the way to go.

It's a matter of discipline, but so is relational modelling. You can make a mess of a document database just like you can make a mess of a relational database. When you migrate data on the fly in a document database, there's more responsibility in your hands, and it requires good care with regards to testing. The same is true for keeping track of data consistency. It's been moved from the database into your application's code. Is that a bad thing? No, it's a sign of the times. You're in charge of your data, it's not your database's task anymore to ensure it's correct and valid, it's yours. With great power comes great responsibility, but I sure like that fact about document databases. It's something I've been missing a lot when working with relational databases: The freedom to do whatever the heck I want with my data.

Read vs. Write Patterns

I just like including this simply because it always holds true, no matter what kind of database you're using. If you're not thinking about how you're going to access your data with both reads and writes, you should do something about that. In the end, your schema should reflect your business use case, but what good is that when it's awkward to access the data, when it takes joins across several tables to fetch the data you're interested in?

If you need to denormalize to improve read access, go for it, but be aware of the consequences. A schema is easy to build up, migrating on the go, but if document databases force you to do one thing, and one thing only, it's to think about how you're reading and writing your data. It's safe to say that you're not going to figure it all out upfront, but you're encouraged to put as much effort into it as you can. When you find out you're wrong down the line, you might be surprised to find that they make it even easier to change paths.

Do your Homework

Someone recently wrote a blog post on why he went back to MySQL from MongoDB, and one of his reasons was that it doesn't support transactions. While this is a stupid argument to bring up in hindsight, it makes one thing clear: You need to do research yourself, noone's going to do it for you. If you don't want to live up to that, use the tools you're familiar with, no harm done.

It should be pretty clear up front what your business use case requires, and what tools may or may not support you in fulfilling these requirements. Not all tool providers are upfront about all the downsides, but hey, neither was MySQL. Read up, try and learn. That's the only thing you can do, and noone will do it for you. Nothing has changed here, it's simply becoming more obvious, because you suddenly have a lot more options to work with.

Polyglot Data Storage

Which brings me to the most important part of them all: Document databases (and alternative, non-relational data stores in general) are not here to replace relational databases. They're living alongside of them, with both sides hopefully somewhat learning from each other. Your projects won't be about just one database any more, it's not unlikely you're going to end up using two or more, for different use cases.

Polyglot persistence is the future. If there's one thing I'm certain of, this is it. Don't let anyone fool you into thinking that their database is the only one you'll need, they all have their place. The hard part is to figure out what place that is. Again, that's up to you to find out. People ask me for particular use cases for non-relational databases, but honestly, there is no real distinction. Without knowing the tools, you'll never find out what the use cases are. Other people can just give you ideas, or talk about how they're using the tools, they can't draw the line for you.

Back to the Future

You shouldn't think of it as something totally new, document databases just don't hide these things from you. Lots of the things I mentioned here are things you should be doing anyway, no matter if you're using a relational or a non-relational data store. They should be common sense really. We're not trying to repeat what went wrong in history, we're learning from it.

If there's one thing you should do, it's to start playing with one of the new tools immediately. I shouldn't even be telling you this, since you should hone your craft all the time, and that includes playing the field and broadening your personal and professional horizon. Only then will you be able to judge what use case is a good fit for e.g. a document database. I'd highly suggest starting to play with e.g. CouchDB, MongoDB, Riak or Redis.

June was an exhausting month for me. I spoke at four different conferences, two of which were not in Berlin. I finished the last talk today, so time to reciprocate on conferences and talks. In all I had good fun. It was a lot of work to get the presentations done (around 400 single slides altogether), but in all I would dare say that it was all more than good practice to work on my presentation skills and to loose a bit of the fear of talking in front of people. But I'll follow up on that stuff in particular in a later post.

RailsWayCon in Berlin

I have to admit that I didn't see much of the conference, I mainly hung around, talked to people, and gave a talk on Redis and how to use it with Ruby. Like last year the conference was mingled in with the International PHP Conference and the German Webinale, a somewhat web-related conference. I made a pretty comprehensive set of slides for Redis, available for your viewing pleasure.

Berlin Buzzwords in Berlin

Hadoop, Lucene, NoSQL, Berlin Buzzwords had it all. I spent most of my time in the talks on the topics around NoSQL, having been given the honor of opening the track with a general introduction on the topic. I can't remember having given a talk in front of this many people. The room took about 250, and it seemed pretty full. Not tooting my own horn here, I've never been more anxious before a talk of how it would go. Obviously there were heaps of people in the room who have only heard of the term, and people who work with or on the tools on a daily basis. Feedback was quite positive, so I guess it turned out pretty okay. Rusty Klophaus wrote two very good recaps of the whole event, read on about day one and day two.

The slide set for my talk has some 120 slides in all, trying to give a no-fuss overview of the NoSQL ecosystem and the ideas and inspirations. There's some historical references in the talk, because in general the technologies aren't revolutionary, they use ideas that've been around for a while and combine them with some newer ones. Do check out the slides for some more details on that.

MongoUK in London

10gen is running MongoDB related conferences in a couple of cities, one of them in London, where I was asked to speak on something related to MongoDB. Since I'm all about diversity, that's pretty much what I ended up talking about, with a hint of MongoDB sprinkled on top of it. Document databases, the web, the universe, all the philosophical foundation knowledge you could ask for. I talked about CouchDB, Riak, and about what makes MongoDB stand out from the rest.

Most enjoyable about MongoUK was to hear about real life experiences of MongoDB users, what kind of problems they had and such. Also, I finally got to see some of London and meet friends, but I'll write more about that (and coffee) on my personal blog. Again, the slide set is available for your document database comparison pleasure.

Cloud Expo Europe in Prague

Just two 36 hours after I got back from London I jumped on the train to Prague to speak about MongoDB at Cloud Expo Europe. Cloud is something I can get on board with (hint: Scalarium), so why the hell not? It turned out to be a pretty enterprisey conference, but still, got some new food for thought on cloud computing in general.

I already gave a talk on MongoDB at Berlin's Ruby brigade, but I built a different slide set this time, improving on the details I found to be a bit confusing at first. Do check out the slides, if you don't know anything about MongoDB yet, it should give you a good idea.

Showing off

As you'll surely notice, my slides are all websites, and not on Slideshare. Two months ago I looked into Scott Chacon's Showoff, a tool to build web-based presentations that simply run as tiny JavaScript apps in the browser. I very much like that idea, because even though Keynote is still the king of the crop, it's still awful. Using Markdown, CSs and JavaScript appeals much more to the geek in my. It's so easy to crank out slides as simple text, and worry about the styling later. Plus, I can easily keep my slides in Git, and who doesn't enjoy that? I'd very much recommend giving it a go. If you want to look at some sources, all my talks and their sources are available on the GitHubs, MongoDB, Redis, NoSQL, document databases and again MongoDB.

It's a pleasure to build slides with Showoff, and it has helped me focus my slides on very short phrases and as few bullet points as possible. Sure, it's not Keynote and doesn't have all the fancy features, but I noticed that it forced me to focus more, and that keeping slides short helped me stay focussed, but again, more on that in a follow-up post.

Feel free to use my slides as inspiration to play with Showoff, there's surprisingly little magic involved. Also, if you think I should speak at a conference you know of or that you're organising, do get in touch.

For an article in a German magazine I've been researching MongoDB over the last week or so. While I didn't need a lot of the information I came across I collected some nicely distilled notes on some of its inner workings. You won't find information on how to get data out of or into MongoDB. The notes deal with the way MongoDB treats and handles your data, a high-low-level view if you will. I tried to keep them as objective as possible, but I added some commentary below.

Most of this is distilled knowledge I gathered from the MongoDB documentation, credit for making such a good resource available for us to read goes to the Mongo team. I added some of my own conclusion where it made sense. They're doing a great job documenting it, and I can highly recommend spending time to go through as much of it as possible to get a good overview of the whys and hows of MongoDB. Also, thanks to Mathias Stearn for hooking me up with some more details on future plans and inner workings in general. If you want to know more about its inner workings, there's a webcast coming up where they're gonna explain how it works.

Basics

  • Name stems from humongous, though (fun fact) mongo has some unfortunate meanings in other languages than English (German for example)
  • Written in C++.
  • Lots of language drivers available, pushed and backed by the MongoDB team. Good momentum here.
  • According to The Changelog Show ([1]) MongoDB was originally part of a cloud web development platform, and at some point was extracted from the rest, open sourced and turned into what it is today.

Collections

  • Data in MongoDB is stored in collections, which in turn is stored in databases. Collections are a way of storing related data (think relational tables, but sans the schema). Collections contain documents which have in turn keys, another name for attributes.
  • Data is limited to around 2 GB on 32-bit systems, because MongoDB uses memory-mapped files, as they're tied to the available memory addressing. (see [2])
  • Documents in collections usually have a similar data structure, but any arbitrary kind of document could be stored, similarity is recommended for index efficiency. Document's can have a maximum size of 4MB.
  • Collections can be namespaced, i.e. logically nested: db.blog.posts, but the collection is still flat as far as MongoDB is concerned, purely an organizational means. Indexes created on a namespaced collection only seem to apply to the namespace they were created on though.
  • A collection is physically created as soon as the first document is created in it.
  • Default limit on number of namespaces per database is 24000 (includes all collections as they're practically the top level namespace in a database), which also includes indexes, so with the maximum of 40 indexes applied to each collection you could have 585 collections in a database. The default can be changed of course, but requires repairing the database if changed on an active instance.
  • While you can put all your data into one single collection, from a performance point of view, it seems to make sense to separate them into different collections, because it allows MongoDB to keep its indexes clean, as they won't index attributes for totally unrelated documents.

Capped Collections

  • Capped collections are fixed-size collections that automatically remove aged entries by LRU. Sounds fancier than it probably is, I'm thinking that documents are just appended at the last writing index, which is reset to 0 when limit of the collection is reached. Preferrable for insert-only use cases, updates of existing documents fail when the data size is larger than before the update. This makes sense because moving an object would destroy the natural insertion order. Limited to ~1GB on 32-bit systems, sky's the limit on 64-bit.
  • Capped collections seem like a good tool for logging data, well knowing that old data is purged automatically, being replaced with new data when the limit is reached. Documents can't be deleted, only the entire collection can be dropped. Capped collections have no indexes on the _id by default, ensuring good write performance. Indexes generally not recommended to ensure high write performance. No index on _id means that walking the collection is preferred over looking up by a key.
  • Documents fetched from a capped collection are returned in the order of their insertion, newest first, think log tailing.

Data Format

  • Data is stored and queried in BSON, think binary-serialized JSON-like data. Features are a superset of JSON, adding support for regular expressions, date, binary data, and their own object id type. All strings are stored in UTF-8 in BSON, sorting on the other hand does not (yet), it uses strcmp, so the order might be different from what you'd expect. There's a sort of specification for BSON, if you're into that kind of stuff: [3] and [4]
  • Documents are not identified by a simple ID, but by an object identifier type, optimized for storage and indexing. Uses machine identifier, timestamp and process id to be reasonably unique. That's the default, and the user is free to assign any value he wishes as a document's ID.
  • MongoDB has a "standard" way of storing references to other documents using the DBRef type, but it doesn't seem to have any advantages (e.g. fetch associated objects with parent) just yet. Some language drivers can take the DBRef object and dereference it.
  • Binary data is serialized in little-endian.
  • Being a binary format, MongoDB doesn't have to parse documents like with JSON, they're a valid in-memory presentation already when coming across the wire.

References

  • Documents can embed a tree of associated data, e.g. tags, comments and the like instead of storing them in different MongoDB documents. This is not specific to MongoDB, but document databases in general (see [5]), but when using find you can dereference nested objects with the dot, e.g. blog.posts.comments.body, and index them with the same notation.
  • It's mostly left to the language drivers to implement automatic dereferencing of associated documents.
  • It's possible to reference documents in other databases.

Indexes

  • Every document gets a default index on the _id attribute, which also enforces uniqueness. It's recommended to index any attribute that's being queried or sorted on.
  • Indexes can be set on any attribute or embedded attributes and documents. Indexes can also be created on multiple attributes, additionally specifying a sort order.
  • If an array attribute is indexed, MongoDB will indexed all the values in it (Multikeys).
  • Unique keys are possible, missing attributes are set to null to ensure a document with the same missing attribute can only be stored once.
  • If it can, MongoDB will only update indexes on keys that changed when updating a document, only if the document hasn't changed in size so much that it must be moved.
  • MongoDB up to 1.2 creates and updates synchronously, 1.3 has support to update indexes in the background

Updates

  • Updates to documents are in-place, allowing for partial updates and atomic operations on attributes (set for all attributes, incr, decr on numbers, push, pop, pull et. al on arrays), also known as modifier operations. If an object grows out of the space originally allocated for it, it'll be moved, which is obviously a lot slower than updating in-place, since indexes need to be updated as well. MongoDB tries to adapt by allocating based on update history (see [6]). Writes are lazy.
  • Not using any modifier operation will result in the full document being updated.
  • Updated can be done with criteria, so a whole bunch of matching documents. Think "update ... where" in SQL. This allows for updating objects based on a particular snapshot, i.e. update based on id and some value in the criteria will only update when the document still has that value. This kind of update is atomic. Reliably updating multiple documents atomically (think transaction) is not possible. There's also findAndModify in 1.3 (see [7]) which allows atomically updating and returning a document.
  • Upserts insert when a record with the given criteria doesn't exist, otherwise updates the found record. They're executed on the collection. A normal save() will do that automatically for any given document. Think find_or_create_by in ActiveRecord.

Querying

  • Results are returned as cursors, walking a collection as it advances. Which explains why you potentially get records that needed to be moved, it pops up in a space that's potentially after its current position, if there's space even in a spot before the current cursor's position. Cursors are fetched in batches of 100 documents or 4 MB of data, whichever's reached first.
  • That's also why it's better to store similar data in a separate collection. Traversing similar data is cheaper than traversing over totally unrelated data, the bigger the size of documents compared to the documents that match your find, the more data will have to be fetched from the database and skipped if it doesn't match your criteria.
  • Data is returned in natural order which doesn't necessarily relate to insertion order, as data can be moved if it doesn't fit into its old spot anymore when updated. For capped collections, natural order is always insertion order.

Durability

  • By default, data in MongoDB is flushed to disk every 60 seconds. Writes to MongoDB (i.e. document creates, updates and deletes) are not stored on disk until the next sync. Tradeoff high write performance vs. durability. Need more durability, reduce sync delay. Closest comparison to the durability behaviour is MySQL's MyISAM.
  • Data is not written transactional, so if the server is killed during a write operation, the data is likely to be inconsistent or even corrupted and needs repair. Think classic file systems like ext2 or MyISAM.
  • In MongoDB 1.3 a database flush to disk can be enforced by sending the fsync command to the server.

Replication

  • Replication is the recommended way of ensuring data durability and failover in MongoDB. A new (i.e. bare and dataless) instance can be hooked onto another at any time, doing an initial cloning of all data, fetching only updates after that.
  • Replica pairs offers an auto-failover mechanism. Initially both settle on which is master and which is slave, the slave taking over should the master go down. Can be used e.g. in the Ruby driver using :left and :right options. There's an algorithm to handle changes when master and slave get out of sync, but it's not fully obvious to me (see [8]). Replica Pairs will be replaced by Replica Sets, allowing for more than one slave. The slave with the most recent data will be promoted master in case of the master going down. The slaves agree which one of them is the new master, so a client could ask any one server in the set which one of them is the master.
  • Replication is asynchronous, so updates won't propagate immediately to the slaves. There's ideas to require the right to be propagated to at least N slaves before returning the write to the client successfully (similar to the feature in MySQL 5.4). (see [9])
  • A master collects its writes in an opslog on which the slaves simply poll for changes. The opslog is a capped collection and therefore not a fully usable transaction log (not written to disk?) as old data is purged automatically, hence not reliable for restoring the database after a crash.
  • After initial clone, slaves poll once on the full opslog, subsequent polls remember the position where the previous poll ended.
  • Replication is not transactional, so the durability of the data on the slave is prone to the same durability conditions as the master, just in a different and still durability-increasing manner, since having a slave allows to decrease sync times on it, and therefore shortening the timespan of data not being written to disk across the setup.

Caching

  • With the default storage engine, caching is basically handled by the operating system's virtual memory manager, since it uses memory-mapped files. File cache == Database cache
  • Caching behaviour relies on the operating system, and can vary, not necessarily the same on every operating system.

Backup

  • If you can live with a temporary write lock on your database, MongoDB 1.3 offers fsync with lock to take a reliable snapshot of the database's files.
  • Otherwise, take the old school way of dumping the data using mongodump, or snapshotting/dumping from a slave database.

Storage

  • Data is stored in subsequently numbered data files, each new one being larger than the former, 2GB being the maximum size a data file can have.
  • Allocation of new datafiles doesn't seem to be exactly related to the amount of data currently being stored. E.g. storage size returned by MongoDB for a collection was 2874825392 bytes, but it had already created almost six gigabytes worth of database files. Maybe that's the result of padding space for records. I haven't found a clear documentation on this behaviour.
  • When MongoDB moves data into a different spot or deletes documents, it keeps track of the free space to reuse in the future. The command repairDatabase() can be use to compact it, but that's a slow and blocking operation.

Concurrency

  • MongoDB refrains from using any kind of locking on data, it has no notion of a transaction or isolation levels. Concurrent writes will simply overwrite each other's data, as they go straight to memory. Exceptions are modifier operations that are guaranteed to be atomic. As there is no way to update multiple records in some sort of transaction, optimistic locking is not possible, at least in a fully reliable way. Since writes are in-place and in-memory first, they're wicked fast.
  • Reads from the database are usually done in cursors, fetching a batch of documents lazily while iterating through it. If records in the cursors are updated while the cursor is being read from, the updated data may or may not show up. There's no kind of isolation level (as there are no locks or snapshotting). Deleted records will be skipped. If a record is updated from another process so that the size increases and the object has to be moved to another spot there's a chance it's returned twice.
  • There's snapshot queries, but even they may or may not return inserted and deleted records. They do ensure that even updated records will be returned only once, but are slower than normal queries.

Memory

  • New data is allocated in memory first, increments seem to be fully related to the amount of data saved.
  • MongoDB seems to be happy to hold on to whatever memory it can get, but at least during fsync it frees as much as possible. Sometimes it just went back to consuming about 512 MB real memory, other times it went down to just a couple of megs, I couldn't for the life of me make out a pattern.
  • When a new database file needs to be created, it looks like MongoDB is forcing all data to be flushed to disk, freeing a dramatic amount of memory. On normal fsyncs, there's no real pattern as to how MongoDB frees memory.
  • It's not obvious how a user can configure how much memory MongoDB can or should use, I guess it's not possible as of now. Memory-mapped files probably just use whatever's available, and be cleaned up automatically by the operating system's virtual memory system.
  • The need to add an additional caching layer is reduced, as object and database representation is the same, and file system and memory cache can work together to speed up access, there's no data conversion involved, at least not on MongoDB's side, data will just be sent serialized and unparsed across the wire. Obviously it depends on the use case if this is really an advantage or a secondary caching layer is still needed.

GridFS

  • Overcomes the 4MB limit on documents by chunking larger binary objects into smaller bits.
  • Can store metadata alongside file data. Metadata can be specified by the user and be arbitrary, e.g. contain access control information, tags, etc.
  • Chunks can be randomly access, so it's possible to fetch data easily whose position in the file is well-known. If random access is required, makes sense to keep chunks small. Default size is 256K.

Protocol Access

  • MongoDB's protocol is binary and in its own right proprietary, hence they offer a lot of language drivers to take that pain away from developers, but also offer a full specification on both BSON and the protocol.

Sharding

  • MongoDB has alpha support for sharding. Its functionality shouldn't be confused with Riak's way of partitioning, it's a whole different story. The current functionality is far from what is planned for production, so take everything listed here with a grain of salt, it merely presents the current state. The final sharding feature is supposed to be free of the restrictions listed here.
  • A shard ideally (but not necessarily) consists of two servers which form a replica pair, or a replica set in the future.
  • All shards are known to a number of config server instances that also know how and where data is partitioned to.
  • Data can be sharded by a specific key. That key can't be changed afterwards, neither can the key's value.
  • Keys chosen should be granular enough so that there's the potential of having too many records with the same key. Data is split into chunks of 50 MB so with big documents, it's probably better to store them in GridFS, as a chunk can contain a minimum of ~12 documents when all take up the available space of 4 MB.
  • Sharding is handled by a number of mongos instances which are connected to the shards which in turn are all known to a number of mongod config server instances. These can run on the same machines as the data-handling mongod instances, with the risk that when the servers go down they also disappear. Having backup services seems to be appropriate in this scenario.
  • Sharding is still in alpha, e.g. currently replicated shards aren't supported in alpha 2, so a reliable sharding setup is currently not possible. If a shard goes down, the data on it is simply unavailable until it's brought back up. Until that happens, all reads will raise an error, even when looking up data that's known to be on the still available shards.
  • There's no auto-balancing to move chunks to new shards, but that can be done manually.

There you go. If you have something to add or to correct, feel free to leave a comment. I'm happy to stand corrected should I have drawn wrong conclusions anywhere.

As a user of CouchDB I gotta say, I was quite sceptical about some of MongoDB's approaches of handling data. Especially durability is something that I was worried about. But while I read through the documentation and played with MongoDB I realized that it's the same story as always: It depends. It's a problem when it's a problem. CouchDB and MongoDB don't necessarily cover the same set of use cases. There's enough use cases where the durability approach of MongoDB is acceptable compared to what you gain, e.g. in development speed, or speed when accessing data, because holy crap, that stuff is fast. There's a good reason for that, as I hope you'll agree after going through these notes. I'm glad I took the time to get to know it better, because the use cases kept popping up in my head where I would prefer it over CouchDB, which isn't always a sweet treat either.

If you haven't already, do give MongoDB a spin, go through their documentation, throw data at it. It's a fun database, and the entrance barrier couldn't be lower. It's a good combination of relational database technologies, with schemaless and JavaScript sprinkled on top.

Tags: nosql, mongodb

I'm gonna eat my own dog food here, and start you off with a collection of links and ideas of people using Redis. Redis' particular way of treating data requires some rethinking how to store your data to benefit from speed, atomicity and its data types. I've already written about Redis in abundance, this post's purpose is to compliment them with real-world scenarios. Maybe you can gather some ideas on how to deal with things.

There's a couple of well-known use cases already, the most popular of them being Resque, a worker queue. RestMQ, an HTTP-based worker queue using Redis, was just recently released too. Both don't make use yet of the rather new blocking pop commands like Redactor does, so there's still room for improvement, and to make them even more reliable.

Ohm is a library to store objects in Redis. While I'm not sure I'd put this layer of abstraction on top of it, it's well worth looking at the code to get inspiration. Same is true for redis-types.

Redis' simplicity, atomicity and speed make it an excellent tool when tracking things directly from the web, e.g. through WebSockets or Comet. If you can use it asynchronously, all the better.

  • Affiliate Click Tracking with Rack and Redis.

    Simple approach to tracking clicks, I probably wouldn't use a list for all clicks, but instead have one for each path, but there's always several ways to get to your goal with Redis. Not exactly the same, but Almaz can track URLs visited by users in Rails applications.

    Update: Turns out that the affiliate click tracking code above, the list is only used to push clicks into a queue, where they're popped off and handled by a worker, as pointed out by Kris in the comments.

  • Building a NLTK FreqDist on Redis

    Calculation of frequency distribution, with data stored in Redis.

  • Gemcutter: Download Statistics

    The RubyGems resource par excellence is going to use Redis's sorted sets to track daily download statistics. While just a proposals, the ideas are well applicable to all sorts of statistics being tracked in today's web applications.

  • Usage stats and Redis

    More on tracking views statistics with Redis.

  • Vanity - Experiment Driven Development

    Split testing tool based on Redis to integrate in your Rails application. Another kind of tracking statistics. If you didn't realize it up to now, Redis is an excellent tool for this kind of application. Data that you wouldn't want to load off to your main database, because let's face it, it's got enough crap to do already.

  • Flow Analysis & Time-based Bloom Filters

    Streaming data analysis for the masses.

  • Crowdsourced document analysis and MP expenses

    While being more prose than code, it still shows areas where Redis is a much better choice than e.g. MySQL.

Using Redis to store any suitable kind of statistics is pretty much an immediate use case for a lot of web applications. I could think of several projects I've work on that could gain something from using certain parts of their application to Redis. It's the kind of data you just don't want to clutter your database with. Clicks, view, history and all that stuff puts an unnecessary amount of data and load on it. The more data it accumulates, the harder it will be to get rid off, especially in MySQL.

It's not hard to tell that we're still far from having heaps of inspiration and real-life use cases to choose from, but these should give you an idea. If you want it can get a lot simpler too. When you're using Redis already, it makes sense to use it for storing Rails sessions.

Redis is a great way to share data between different processes, be it Ruby or something else. The atomic access to lists, strings and sets, together with speedy access ensures that you don't even need to worry about concurrency issues when reading and writing data. On Scalarium, we're using it mostly for sharing data between processes.

E.g., all communication between our system and clients on the instances we boot for our users is encrypted and signed. To ensure that all processes have access to the keys, they're stored conveniently in Redis. Even though that means the data is duplicated from our main database (which is CouchDB if you must know), access to Redis is a lot faster. We keep statistics about the instances in Redis too, because CouchDB is just not made for writing heaps and heaps of data quickly. Redis also tracks a request token that is used to authenticate internal requests in our asynchronous messaging system, to make sure that they can't be compromised from some external source. Each request gets assigned a unique token. The token is stored in Redis before the message is published and checked before the message is consumed. That way we turned Redis into a trusted source for shared data between web and worker processes.

The library memodis makes sharing data incredibly easy, it offers Redis-based memoization. When you assign a memodis'd attribute in your code, it'll be stored in Redis and therefore can be easily read from other processes.

Redis is incredibly versatile, and if you have a real-life Redis story or usage scenario to share, please do.

Tags: nosql, redis

The NoSQL landscape is a fickle thing, new tools popping up every week, broadening a spectrum that's already close to being ungraspable, especially when you're totally new to the whole thing. There's a couple of common misconceptions and wrong-doings that people who've been playing with the tools already tend to tell newbies in the landscape.

I'm guilty as charged too, I tend to tell people about the tools I already know. Being a good thing per se, because the recommendation is based more or less on experience, it leaves out one thing that I find to be the most important philosophy about post-relational (much nicer term than NoSQL) databases: It's all about your data, about its needs and how your application needs to access them. The times of generic, one-size-fits-all tools like MySQL, PostgreSQL and the like are over, it's well worth knowing how they're different, and what tool would be the best partner in chrime to get stuff done.

No Size Fits All

While you could throw MySQL at a lot of problems, it was far from being the optimal choice in a lot of cases, but you could somehow bend it to your will. The new generation of tools tends to avoid having to be bent, instead they give you a freedom of choice. The freedom to analyse what your data is like, and what the right tool for your specific use case is.

Does that mean more work on your end? It sure does, but for the love of your data, it will be worth it. If you find the right partner, and it makes your life easier, it's a win on both ends. You'll be a happy developer (well, most of the time), and your data will be able to roam free, running naked across a meadow, hand in hand with the tool you chose.

Now I'm well aware that this sounds all bloomy, but that's what it boils down to. The choice is now up to you, that's why it's important to know what's out there, to play with the tools available, to know how and why they're different from each other.

To get you started, have a look at Vineet Gupta's excellent overview of the NoSQL landscape.

Don't Believe Everything You Hear

If someone tells you that you should try a specific tool, ask him why. If the answer is speed, or because it's written in Erlang and scales insanely well, it's time to call bullshit on him. MySQL can be fast too, that's not an issue. It's nice to be able to have a database that you can scale up to hundreds of nodes easily, but while the technology behind it is very interesting, and sometimes mind-blowing, it doesn't help if it's a pain to work with, or if there's no library support yet. Sure you could write your own, but if you're totally new to the field, you usually just want to play and learn. Hard thing to do if all you get is just an API and a very limited language support.

If someone tells you that e.g. MongoDB is fast, then there's reason for that, and it's good to be well aware of it and what consequences it has for operating your application. If someone tells you that CouchDB is awesome for building web applications, because it's built of the web, they're leaving out that a common use case like pagination is still an awful thing to implement with it. If someone tells you that Cassandra scales easily because it was built at Facebook, they're leaving out that its peculiar way of storing and accessing data is very specific to how sites like Facebook need to access their data. I could go on and on about it, but there's always two side of a story.

Before judging a tool based on just the one side, look at the other side too. It might not be as big of a problem as you thought it would, either way, you know why things are the way they are. Look for tools with sites describing particular use cases, or areas where they're just not a good fit. If the tool builders aren't aware of use cases, strengths and weaknesses, how will you be?

In the end, even though they can be problems for others, they don't necessarily are problems for you. Your particular use case might be just fine with the downsides, but on the other hand gaining high profit from the upsides. If it's not, at least you're more than free to go look somewhere else. At least now you have the (free) options to do so.

There's misconceptions out there being close to urban myths, and we're only two years or so into working with the new generation of tools. The only way you can avoid falling into a trap is to play with what's out there, to know their weaknesses and strengths. The only thing we can do to avoid having people fall into the trap is to better educate them, to give them real-world examples, use cases other than tagging and blog posts. Just saying that it scales better than xyz is not an argument, it's educating people on the wrong end.

It's Not About Speed And Scaling

If speed and scaling were our only problems, we'd be left in a big world of pain. As beautiful as these words are, I'm gonna go out on a limb and say that it's not a problem until it's a problem. Unless you're already Facebook or LinkedIn, you don't need to have that as a main factor when choosing the right tool. Sure, it's better if there's an easy way to scale up in the future, but what's the point if you needs days to get a good set up before having written a single line of code?

Most NoSQL tools were built with some sort of scaling in mind, although people tend to easily confuse scaling, distribution, sharding and partitioning, so you're safe in most cases when it comes to the point where your application needs to handle more traffic.

I'm gonna go ahead and venture the guess that if you're deciding solely based on speed and scalability, you're doing it wrong. And I rarely use that phrase. You should be deciding based on the core feature set, why it does the things it does, and what consequences it'd have on your life as a developer.

Don't Compare Apples And Oranges

No tool is like the other. Just comparing e.g. MongoDB, Redis and MySQL is the wrong way to approach your problem, especially if you just look at speed and comparing their feature set. Feature sets and speed are usually different for a reason. Instead you should be comparing every tool with your data. How much do you need to bend the data to store and access it easily. Is it even possible to store it efficiently and or your particular use case? Are potential trade-offs (e.g. data duplication to gain speedier access) worth risking? Is it the right fit in the way it handles updates, associations, writes, reads and queries in terms of your data and application? Then go right ahead and use it.

But don't just compare tools with each other whose only feature they have in common is the fact that they can store data, or things that are mostly depending on an application's specific needs. To give you an idea, this guy compares Redis and MongoDB by implementing a particular use case with both of them. That's the way you should be comparing tools.

The Heat Is On

We're going to see more tools popping up left and right, making it harder to keep up, and to make an informed decision. What I consider the best thing about most of them is that they're free. You can grab the source code, improve it or just look at how it handles your data. That's what makes them so awesome, their incentive is not to constrain your data, they're as open as possible about it, some tools even going as far as building solely on open standards to implement their whole stack (that'd be CouchDB if you're curious).

The whole point of this post is that it's up to you to find the perfect tool to hand your data to. I don't know about you, but me being able to find the right fit instead of squeezing my data into a database that tries to solve all problems at once, that's the most exciting prospect of post-relational databases for me. Our common goal should be to help people make that decision without getting too passionate about any particular tool. They all exist to fulfill some purpose, and we should be telling people about them.

There's a couple of sites to keep an eye on, e.g. MyNoSQL by Alex Popescu, he's keen on keeping up-to-date with what's going on in the NoSQL community. Another site with a growing collection of links to articles is nosql-databases.org. EngineYard published a series of blog posts on key-value stores in Ruby, in particular Cassandra, Redis, MongoDB, CouchDB, LDAP that's well worth checking out to get an idea of what's out there.

Tags: nosql

A very valid question is: What's a good use case for Redis? There's quite a few, as Redis isn't your every day key-value store, it allows you to keeps lists and sets in your datastore, and to run atomic operations on them, like pushing and popping elements. All that stuff is incredibly fast, as obviously your data is held in memory and only persisted to the hard disk if necessary and to top it off, asynchronously, while not reducing the throughput of the server itself.

The simplest and most obvious use case is a cache. Redis clocks in at almost the speed of Memcached, with a couple of features sprinkled on top. If you need a cache, but maybe have a use case where you want also want to store data you in it that you want to be persisted, Redis is a decent tool for your caching needs. If you already have a Memcached instance in place I'd look at my options before adding a new component to my infrastructure though.

Pushing and popping elements atomically, does that ring a bell? Correct, that's what you want from a worker queue. Look at delayed_job, you'll find that it uses a locking column in your jobs table. Some people argue that a database should not be the place where you keep your worker jobs. Up to a certain amount of work I disagree, but at some point the performance costs outweigh the benefits, and it's time to move on. Redis is a perfect fit here. No locking needed, just push on the list of jobs and pop back off it in your workers, simple like that. It's the GitHub way, and the more I think about it, the more sense it makes.

For Redis 1.1 Salvatore has been working on a proposal by Ezra from Engine Yard to implement a command that would move items from one list to another in one step, atomically. The idea is to mark a job as in progress, while not removing it entirely from the data storage. Reliable messaging anyone? It's such a simple yet genius idea, and Redis has most of the functionality already in place. There's heaps more planned for future Redis releases, I'd highly recommend keeping an eye on the mailing list and on Salvatore's Twitter stream.

As I'm sure you noticed Redis is used for data storage in hurl, a neat little app to debug HTTP calls. Redis is simply used to store your personalized list of URLs you checked. Should some data be lost in between database dumps, it's not a big deal, it's not great sure, but not a big deal.

The simple answer for when to use Redis is: Whenever you want to store data fast that doesn't need to be 100% consistent. In the past projects I've worked on that includes classic examples of web application data, especially when there's social stuff sprinkled on top: ratings, comments, views, clicks, all the social stuff you could think of. With Redis, some of it is just a simple increment command or pushing something onto a list. Here's a nice example of affiliate click tracking using Rack and Redis.

Why is that a good match? Because if some of that data is lost, it doesn't make much of a difference. Throw in all the statistical or historial data you can think of that's accumulated in some way through your application, and could be recalculated if necessary. That data usually just keeps clogging up your database, and is harder and harder to get rid of as it grows.

Same is true for activity streams, logging history, all that stuff that is nonvolatile yet doesn't need to be fully consistent, where some data loss is acceptable. You'd be surprised how much of your data that includes. It does not, and let me be perfectly clear on that, include data that involves any sort of business transaction, be it for a shopping platform or for data involved in transactions for software as a service applications. While I don't insist you store that data in a relational database, at least it needs to go into a reliable and fully recoverable datastore.

One last example, the one that brought me and Redis together is Nanite, a self-assembling fabric of Ruby daemons. The mapper layer in Nanite keeps track of the state of the daemons in the cluster. That state can be kept on each mapper redundantly, but better yet, it should be stored in Redis. I've written a post about that a while back, but it's still another prime use for Redis. State that, should it or part of it get lost, will recover all by itself and automatically (best case scenario, but that's how it works in Nanite).

One thing to be careful though is that Redis can only take as much data as it has memory available. Especially for data that has the potential to grow exponential with users and their actions in your application, it's good to keep an eye on it and to do some basic calculations, but you should even do that when using something like MySQL. When in doubt, throw more memory at it. With Redis and its master-slave replication it's very easy to add a new machine with more memory, do one sync and promoto the slave to the new master within a matter of minutes. Try doing that with MySQL.

For me, this stuff is not about just being awesome. I've had countless situation where I had data that could've been handled more elegantly using something like Redis, or a fully persistent key-value store like Tokyo Tyrant. Now there's really no excuse to get that pesky data clogging up your database out of there. These are just some examples.

By the way, if you want to know what your Redis server is doing, telnet to your Redis instance on port 6379, and just enter "monitor". Watch in awe as all the commands coming in from other clients appear on your screen.

In the next post we'll dig into how you can store data from your objects conveniently into Redis.

Redis, consider us for your next project.

Tags: nosql, redis

I like to think that there's never been a more exciting time when it comes to playing with new technologies. Sure, that's a bit selfish, but that's just how I feel. Doing Java after I got my diploma was interesting, but it wasn't exciting. Definitely not compared to the tools that keep popping up everywhere.

One "movement" (if you can even call it that) is NoSQL. I've never been particularly happy with relational databases, and I happily dropped MySQL and the like when an opportunity to work with something entirely new came up. Since it's my own project I'm not putting anything at risk, and I don't regret taking that step. We're working with two members of the NoSQL family in particular, CouchDB and Redis.

Last week people interested in and people working with and on these new and pretty fascinating tools came together for the first NoSQL meetup in Berlin. I talked about Redis, and before I keep blabbering on about it, here are my slides. The talks have been filmed, so expect an announcement for the videos soon-ish.

I was up against a tough competition, including CouchDB, Riak and MongoDB (but we're all friends, no hard feelings). During my talk, I might've overused the word awesome. But after all the talks were over, it hit me: Redis is awesome. It seriously is. Not because it does a lot of things, is distributed, written in Erlang (it's written in old-school, wicked fast C), has support for JSON (though that's planned), and all that stuff. No, it's awesome because it does only a very small set of work for you, but it does it extremely well, and wicked fast. I don't know about you, but I like tools like that. I took a tour of the C code last week, and even though my skills in that area are a bit rusty, it was quite pleasant to read, and easy to follow the flow.

I like Redis, and while I don't ask you to love it too, do yourself a favor and check it out. It gives Memcached a serious run for its money. Everyone loves benchmarks, and I do to, but I'm careful not reading too much into them. I ran Redis and Memcached through their paces, using the available Ruby libraries. I tried both the C-based and and Ruby-based version for Memcached, and the canonical Ruby version for Redis. It's like a cache, with sugar sprinkled on top.

Without putting out any numbers, let me just say that, first it's a shame Rails is shipped with the Ruby version of the Memcached library, because it is sloooow. Okay not so slow you should be worried, but slower than the competition. Second, Redis clocks in right in the middle between both Memcached libraries. While it's faster than memcache-client, it's still a bit slower than memcached. Did I mention that the library for Redis is pure Ruby? Pretty impressive, especially considering what you get in return. Sit back for a moment, and think about how much work went into Memcached already, and how young Redis still is. Oh the possibilities.

Redis is more than just a key-value store, it's a lifestyle. No wait, that's something different. But it still requires you to think differently. Shouldn't be a surprise really, most of the new generation of data stores do. It takes any data you give to it, and you're good to go as long as it fits into your memory. Let me tell you, that's still a lot of data. Salvatore is constantly working on new features for Redis, so keep an eye on its GitHub repository. If you thought that pushing and popping elements atomically off lists was cool, there might be a big warm surprise for you in the near future.

I first came across it using Nanite, where it's used to store the state of the daemon cluster. Running it through its paces in preparation for the talk I realized how underused it is. For our use case, Redis is the perfect place to store stuff like history of system data, e.g. CPU usage, load, memory usage and the like. It's also a great fit for a worker queue, but since we have RabbitMQ in place, there's no need for that.

When you look at it closely, there's heaps of uses for Redis. Chris Wanstrath wrote about how he used it writing hurl, and Simon Willison also published a love letter to Redis, there's also more info on how you use it with the Ruby library over at the EngineYard blog, and James Edward Grey published a whole serious on how to install, setup and use Redis with Ruby. Just like CouchDB I want to put Redis to more uses in the future. That doesn't mean I'm looking to find a problem for a solution, it just means that when I have a problem I'm gonna consider my options, and Redis is one of them. It's a perfect mix between a simple yet insanely speedy data store, but with the little twist that is Redis' way of persisting data.

Tags: redis, nosql

Call it NoSQL, call it post-relational, call it what you like, but it's hard to ignore that hings are happening in the database world. A paradigm shift is not too far ahead, and it's a big one, and I for one am welcoming our post-relational overlords. Whatever you call them, CouchDB, MongoDB (although you really shouldn't call a database MongoDB), Cassandra, Redis, Tokyo Cabinet, etc. I'm well aware that they're not necessarily all the same, but they do try to fill similar gaps. Making data storage easy as pie, offering data storage fitting with the kind of evolving data we usually find on the web.

The web is slowly moving away from relational databases and with that, SQL. Let me just say it upfront: I hate SQL. I physically hate it. It doesn't fit with my way of thinking about problems, and it doesn't fit with the web. That's my opinion anyway, but I'm sure I'm not alone with it.

There is however one area where object-oriented databases failed, and where the new generation of document databases will have similar problems. You could argue that object-oriented databases are in some way a predecessor to modern post-relational databases, they made storing objects insanely easy, no matter how complex they were, and they made navigating through objects trees even easier and insanely fast. Which made them applicable to some problems, but they weren't flexible enough in my opinion. But they still laid some groundwork.

skitched-20090908-165230.jpg

It's mainly concerning The Enterprise and their giant collection of reporting tools. Everybody loves tools, and The Enterprise especially loves them. The more expensive, the better. Reporting tools are the base for those awesome pie charts they just love to fill entire PowerPoint presentations with. They work on "standardized" interfaces and languages and therefore, with SQL.

I've worked on a project were we switched from an object-oriented to a relational database just because of that. Sure, there's proprietary query languages, or there's JQL when you're into JDO, EJB3 and the like. But they're nowhere as powerful as SQL is. They're also not as brain-twisting. That should be a good thing really, but there you have it.

NoSQL databases are facing a similar dilemma. Just like object-oriented databases they're awesome for just dumping data in it, more or less structured. It's easy to get them out too, and it's usually easy to aggregate the data in some way. Is it a big deal? Of course not, at least not in my opinion. But if it is some sort of deal, what can you do to work around that?

  • Ignore it. Simple, isn't it? The reporting requirement can usually be solved in a different way. Sure, it can be more work, but usually reporting is less of a killer than some might think. Give the client some way to express a query and let him at it. Give him a spare instance of your replicated database, and let him work off that data. Best thing you could do is pre-aggregate it as much as possible so there's less work for the client.

  • If you really need structured data in a relational database, consider replicating the data into one from your post-relational database of choice. I can hear you say: That guy's crazy, that'd involve so much work keeping the two in sync! No, it wouldn't. Create a fresh dump every time you need a current dataset, and dump it into your SQL database. Simple like that.

  • Put an interface in front of the new database. Yes, it's insane, but I've done it, and it works. It doesn't have to be an SQL interface, just a common interface that works with one set of reporting tools. Yes, it's not ideal, but it's an option.

  • Don't ignore it, keep using a relational database. Yep, not all of us are lucky enough, someone still has to serve the market demands. Legacy projects or clients are forcing us to stick with the old and the dusty model of storing and retrieving data. Quite a lot of people are happy with that, but I'm not.

I'm sure there's other options, these are just off the top of my head, and I can say that I've practiced all of them with more or less good results.. I for one am sick of still having to use MySQL on new projects. I've had my fun with it, and sure there's a whole bunch of patches that make it a bit more fun, but it's still MySQL. Yes, I am aware that there's PostgreSQL, but it's the same story. Old, old and old.

Should you still try to get a new generation database into new projects? Yes, yes and yes, you definitely should. Consider yourself lucky if you succeed, because you're still an early awesome adopter. Even use SimpleDB if you must, but maybe reconsider before you really use it, it's not great. But don't lie to your clients, they should be aware what they're getting into. It's no big deal, but the bigger they are the more likely they have administrators not yet familiar with the new tools. But the more people start using them now, the better they'll get before they hit the mainstream. Which they will eventually, rest assured. I'm ready, the web is ready, and the tools are ready. What about you?

Tags: nosql, databases
<< Archives | Search | RSS Feed