The NoSQL landscape is a fickle thing, new tools popping up every week, broadening a spectrum that's already close to being ungraspable, especially when you're totally new to the whole thing. There's a couple of common misconceptions and wrong-doings that people who've been playing with the tools already tend to tell newbies in the landscape.

I'm guilty as charged too, I tend to tell people about the tools I already know. Being a good thing per se, because the recommendation is based more or less on experience, it leaves out one thing that I find to be the most important philosophy about post-relational (much nicer term than NoSQL) databases: It's all about your data, about its needs and how your application needs to access them. The times of generic, one-size-fits-all tools like MySQL, PostgreSQL and the like are over, it's well worth knowing how they're different, and what tool would be the best partner in chrime to get stuff done.

No Size Fits All

While you could throw MySQL at a lot of problems, it was far from being the optimal choice in a lot of cases, but you could somehow bend it to your will. The new generation of tools tends to avoid having to be bent, instead they give you a freedom of choice. The freedom to analyse what your data is like, and what the right tool for your specific use case is.

Does that mean more work on your end? It sure does, but for the love of your data, it will be worth it. If you find the right partner, and it makes your life easier, it's a win on both ends. You'll be a happy developer (well, most of the time), and your data will be able to roam free, running naked across a meadow, hand in hand with the tool you chose.

Now I'm well aware that this sounds all bloomy, but that's what it boils down to. The choice is now up to you, that's why it's important to know what's out there, to play with the tools available, to know how and why they're different from each other.

To get you started, have a look at Vineet Gupta's excellent overview of the NoSQL landscape.

Don't Believe Everything You Hear

If someone tells you that you should try a specific tool, ask him why. If the answer is speed, or because it's written in Erlang and scales insanely well, it's time to call bullshit on him. MySQL can be fast too, that's not an issue. It's nice to be able to have a database that you can scale up to hundreds of nodes easily, but while the technology behind it is very interesting, and sometimes mind-blowing, it doesn't help if it's a pain to work with, or if there's no library support yet. Sure you could write your own, but if you're totally new to the field, you usually just want to play and learn. Hard thing to do if all you get is just an API and a very limited language support.

If someone tells you that e.g. MongoDB is fast, then there's reason for that, and it's good to be well aware of it and what consequences it has for operating your application. If someone tells you that CouchDB is awesome for building web applications, because it's built of the web, they're leaving out that a common use case like pagination is still an awful thing to implement with it. If someone tells you that Cassandra scales easily because it was built at Facebook, they're leaving out that its peculiar way of storing and accessing data is very specific to how sites like Facebook need to access their data. I could go on and on about it, but there's always two side of a story.

Before judging a tool based on just the one side, look at the other side too. It might not be as big of a problem as you thought it would, either way, you know why things are the way they are. Look for tools with sites describing particular use cases, or areas where they're just not a good fit. If the tool builders aren't aware of use cases, strengths and weaknesses, how will you be?

In the end, even though they can be problems for others, they don't necessarily are problems for you. Your particular use case might be just fine with the downsides, but on the other hand gaining high profit from the upsides. If it's not, at least you're more than free to go look somewhere else. At least now you have the (free) options to do so.

There's misconceptions out there being close to urban myths, and we're only two years or so into working with the new generation of tools. The only way you can avoid falling into a trap is to play with what's out there, to know their weaknesses and strengths. The only thing we can do to avoid having people fall into the trap is to better educate them, to give them real-world examples, use cases other than tagging and blog posts. Just saying that it scales better than xyz is not an argument, it's educating people on the wrong end.

It's Not About Speed And Scaling

If speed and scaling were our only problems, we'd be left in a big world of pain. As beautiful as these words are, I'm gonna go out on a limb and say that it's not a problem until it's a problem. Unless you're already Facebook or LinkedIn, you don't need to have that as a main factor when choosing the right tool. Sure, it's better if there's an easy way to scale up in the future, but what's the point if you needs days to get a good set up before having written a single line of code?

Most NoSQL tools were built with some sort of scaling in mind, although people tend to easily confuse scaling, distribution, sharding and partitioning, so you're safe in most cases when it comes to the point where your application needs to handle more traffic.

I'm gonna go ahead and venture the guess that if you're deciding solely based on speed and scalability, you're doing it wrong. And I rarely use that phrase. You should be deciding based on the core feature set, why it does the things it does, and what consequences it'd have on your life as a developer.

Don't Compare Apples And Oranges

No tool is like the other. Just comparing e.g. MongoDB, Redis and MySQL is the wrong way to approach your problem, especially if you just look at speed and comparing their feature set. Feature sets and speed are usually different for a reason. Instead you should be comparing every tool with your data. How much do you need to bend the data to store and access it easily. Is it even possible to store it efficiently and or your particular use case? Are potential trade-offs (e.g. data duplication to gain speedier access) worth risking? Is it the right fit in the way it handles updates, associations, writes, reads and queries in terms of your data and application? Then go right ahead and use it.

But don't just compare tools with each other whose only feature they have in common is the fact that they can store data, or things that are mostly depending on an application's specific needs. To give you an idea, this guy compares Redis and MongoDB by implementing a particular use case with both of them. That's the way you should be comparing tools.

The Heat Is On

We're going to see more tools popping up left and right, making it harder to keep up, and to make an informed decision. What I consider the best thing about most of them is that they're free. You can grab the source code, improve it or just look at how it handles your data. That's what makes them so awesome, their incentive is not to constrain your data, they're as open as possible about it, some tools even going as far as building solely on open standards to implement their whole stack (that'd be CouchDB if you're curious).

The whole point of this post is that it's up to you to find the perfect tool to hand your data to. I don't know about you, but me being able to find the right fit instead of squeezing my data into a database that tries to solve all problems at once, that's the most exciting prospect of post-relational databases for me. Our common goal should be to help people make that decision without getting too passionate about any particular tool. They all exist to fulfill some purpose, and we should be telling people about them.

There's a couple of sites to keep an eye on, e.g. MyNoSQL by Alex Popescu, he's keen on keeping up-to-date with what's going on in the NoSQL community. Another site with a growing collection of links to articles is EngineYard published a series of blog posts on key-value stores in Ruby, in particular Cassandra, Redis, MongoDB, CouchDB, LDAP that's well worth checking out to get an idea of what's out there.

Tags: nosql

I've been spending some quality time with two of my new favorite tools lately (CouchDB and Redis, duh!), and while integrating them into Scalarium some needs and as a result some smaller hacks emerged. I don't want to deprive the world of their joy, so here they are.

First one is a tiny gem that will allow you to use Redis as a session store. What's so special about it, there's redis-store, right? Sure, but I couldn't for the life of me get to work reliably. Seems that's due to some oddity in Rack or something, at least that's where my interest of further investigating the issues faded, and I decided to just rip the code off MemCacheStore, and there you have it, redis-session-store. Rails-only and proud of it.

While working on it I constantly kept a monitor process open on Redis. Great feature by the way, if not awesome. I used telnet, and somehow I constantly managed to hit Ctrl-C in the terminal I had the telnet session open in. Reconnecting manually is tedious, so I give you my little redis-monitor script:

Incredibly simple, but saves those precious moments you'd waste typing everything by hand.

Last but not least, here's a hack-ish patch to make CouchPotato (great CouchDB Ruby library by the way) dump view queries into the log file. The ugly part at the end is me trying to get the log that's output at the end of each request to include the DB time for CouchDB queries.

It's not great, but works for now. We'll very likely include something decent into CouchPotato without hacking into ActionController like that. Unfortunately to get this far, there's really no other way. I tried faking ActiveRecord, but you open a whole other can of worm doing that, because a lot of code in Rails seems to rely on the existance on the ActiveRecord constant, assuming you're using the full AR stack when the constant is defined. Here's hoping that stuff is out the door in Rails 3. Haven't checked to be honest.

Dump that file into an initializer, and you're good to go (for the moment).

Tags: rails, redis, couchdb

After being annoyed with running multiple versions of Ruby just by using MacPorts I finally gave in and tried out rvm, the Ruby Version Manager. That stuff got even more annoying when I tried to make Bundler behave well with multiple Ruby versions, because it just doesn't by default. It's not really a problem with normal gems, but Bundler falls apart with its defaults when you're trying to run gems with native extensions. Hint: Set bundle_path to include RUBY_VERSION and make some links from one cache directory to another to not have every gem cached for every Ruby version.

The promise of being able to easily switch between different versions and still having just one ruby binary and not one called ruby1.9 with MacPorts is just neat. While installing them is straight forward, using them from e.g. TextMate is not great. The common solutions of just launching it from the command line or modifying the TextMate Ruby bundle (these changes will have to be made again with the next TextMate update) are not fully acceptable for me, because it still doesn't allow me to switch Ruby versions while TextMate is running. That's one "flaw" rvm has, at least for me. It switches the paths for the Ruby versions for the current shell, it doesn't offer anything to set links in ~/.rvm/bin to the currently active Ruby version, at least as far as I know. No big deal, if it's by design I can live with that, I do think it'd be a nice addition though.

Anyway, I wanted to switch Ruby versions from my shell and have it affect the version I'm using to run my tests from TextMate too. The way to go seems to be rvm <version> --default which will set the default for all other shells. Be aware that it will do what it says, but I could live with that. It's more important to me to be able to make that switch than just having several shells with different versions in each. First step was to shorten that command, because let's face it, that's a lot of text. I added a function to my .zshrc. It should work just as well with bash, but really, you're still using bash?

rvmd() {rvm use $1 --default}

Now you can just rvmd 1.9.1 in your shell prompt and be done with it. Much better.

The other part was telling TextMate what Ruby binary to use. The problem outlined above made that a bit of a pain, so I broke out my shell scripting fu and cranked out this amazing wrapper script, using what rvm already dumps in your rc files:

if [[ -s /Volumes/Users/pom/.rvm/scripts/rvm ]]
  source /Volumes/Users/pom/.rvm/scripts/rvm
`which ruby` $*

Impressive, eh? It just sources the rvm script and then calls the ruby binary that is currently set as default. Make it executable and set a shell variable in TextMate called TM_RUBY and make it point to that script, and you're good to go.

Tags: ruby, textmate