I don't use RSpec a lot any more these days. I much prefer Shoulda, heck I even started using Rails integration tests again (using Shoulda of course), because sometimes the additional abstraction of Cucumber is just too much. Any way, there's some things I liked about RSpec, and they were not related to the features of the testing DSL itself, but more to the tool RSpec. It has a neat formatter that'll output the ten slowest-running tests. I also found the colored list of full test names to be very helpful.

So I scratched my itch last weekend and brought that goodness to Shoulda. Our test suite is starting to get a bit slow, and it already served us well to find the slowest one. I like the approach of rinse and repeat to squeeze some valueable dozens of seconds out of a test suite with that technique.

So without further ado, I give you shoulda-addons, my little patch set to bring both test profiling and a colored list of full test names to you Shoulda test suite. I'm sure it'd work with normal Test::Unit or MiniTest without much effort, but for now it's made for Shoulda, and it looks like this:

Screen shot 2009-10-19 at 22.11.07

While adding in the profiling was pretty straight forward, getting the colored output was pretty messy, and I'm not proud of it, especially considering that Test::Unit and MiniTest go different routes of outputting the little dots F's and E's.

The package is up on GitHub, can be installed from Gemcutter via gem install shoulda-addons, and should work with Ruby 1.8 and 1.9. I also tested it with Mocha included, so let me know if something doesn't work for you.

I've had the dubious pleasure of working with a certain library. It's a layer to talk to most of the Amazon Web Services APIs. While working with web services usually is a particular awful experience, this library doesn't make much of an effort to hide their awkwardness, in fact, in some ways it even makes it worse. It's pretty old news that I enjoy bitching about code I don't like, but I also like to keep it positive in at least thinking about how it could be improved.

Let's take a moment and have a look at what part of their API is particularly nasty and how the situation could be improved. We actually wrote a micro-wrapper around that library to make things a bit less awful for us, and to at least keep their code out of ours.

I'm looking in particular at the EC2 part of the library. Here's the signature of the method that runs a new instance:

run_instances(image_id, min_count, max_count, group_ids, key_name,
              user_data='', addressing_type = nil, instance_type = nil,
              kernel_id = nil, ramdisk_id = nil, availability_zone = nil,
              block_device_mappings = nil)

Yes, that's 12 arguments you can hand over to the method, most of them being optional. Now, what's it look like calling it:

run_instances('ami-123445', 1, 1, [], 'default')

So far so good. But what if you want to specify a different availability zone for e.g. the EU:

run_instances('ami-123445', 1, 1, [], 'default', '', nil, nil, nil, nil, 'eu-west1b')

Nice, eh? It's starting to get really ugly. The important parameters are hidden by unnecessary noise. What's with max_count and min_count not having default values? I'd argue that you normally it'd be quite reasonable to assume that you only want to run one instance when calling the method without these two parameters.

Anyway, that's just one example, in general, there's only one important piece of information, and that's the AMI identifier. Let's look at what this method could look like when you sprinkle some very common Rubyisms on top, things that should be quite common sense when writing code in Ruby, but seem to have been lost here in favor of resembling the original API as close as possible, even if that means writing cumbersome code:

run_instance('ami-123445', :availability_zone => 'eu-west1b', :ssh_key => 'default')

Now the method only takes one argument and a has of options. Is it more code to write? Yes. Are its intentions clearer? I'd argue that they are. Is it close to what the API underneath expects as arguments? Closer than before. Unnecessary information is left out, because either the method itself could send sensible defaults, or the web service itself does it. The EC2 API even has a default for the instance type, and it doesn't care if you leave out the RAM disk or specific kernel. It does what's necessary to ensure your instance is up.

Knowing that, it's not that hard to hide those details behind a slightly improved version of run_instances, in this case even called run_instance. Throw in a method run_instances to handle specifics of launching multiple instances at once. Is it more code to write? Sure. But it sure as hell makes for a nicer usage of the library itself. Instead of having to look up the order of the method's parameters and fill in blanks with nil, I only have to look what options the methods accepts.

In general I don't like giving a method more than three arguments. It simply means it's doing too much, or that the data it needs should be wrapped in a simple data structure, e.g. a hash (which is just called for in Ruby) or a simple struct. The more arguments you give a method, the harder it will be for users to figure out its usage, especially if you make a whole stash of them optional, because then the order of them starts to matter, and users have to dig through an endless line of documentation just to figure out where the argument most important to them needs to be placed among a pile of nils.

run_instances in said library returns an array of hashes. Those hashes contain the data returned by the web service, each entry in the array describing one instance. Apart from the fact that you always have to get the first entry of that array when you just launched one instance, it also does something weird. EC2 returns attributes in the hash using names like imageId, instanceId, keyName and so on. You could argue about the naming style, but the names sure are reasonable. What run_instances does is go ahead and prefix almost every attribute with aws_, but not all of them. Most methods in said library do the same, so at least there's some consistency, but it sure isn't great.

So while trying to resemble the API call, it makes you care about what's what instead of just giving you the attributes as it receives them from the API. The answer to this is simple, give me the damn attributes as you get them, simple like that. If you need to ensure that there are no naming clashes, put them in a subhash e.g. using the key :ec2 if you must.

Another example is this really simple method:

describe_instances(list=[])

It takes an array as an argument, but nothing more. Even if you just want to get the data for one instance, you have to call it like this:

describe_instances(['i-123454'])

I don't know about you, but I think that looks just gross. We're using Ruby for chrissakes, there's a simple mechanism to ensure you get an array, but the library user doesn't need to specify it as such, it's called splat arguments. It's a simple fix, and has great effect.

describe_instances('i-123454', 'i-213143')

There, much better. Again, I'd make the case for a convenience method for getting the data for just one instance, which in turn fetches the first element from the resulting array. If you must, generate the code for it, but it makes code using the library a lot easier on the eyes. Sure, it's more work for you as the library's author, but your users will be thankful, I'm sure.

Last example, the library's layer to access S3. Now, when I reach for my bucket, I can use this method:

bucket(name, create=false, perms=nil, headers={})

The second parameter caught my eye, according to the RDoc it creates the bucket if it doesn't exist. Looking at the code it just doesn't even check if the bucket already exists, it just goes ahead and does a PUT request whenever you specify true. So to make the intention clearer you actually have to do the checking yourself and recall the method with create set to true to create the bucket on the second run. Because if it doesn't exist bucket will just return nil. You could end up with something like this.

bucket('bucket-for-monsieur') or bucket('bucket-for-monsieur', true)

So the RDoc is actually lying to us, and we have to bend over and ruin one line of code just because our library is too lazy to do the job for us. If you ask me, the whole method signature looks a bit odd. Somehow it also calls for a little makeover with an options hash, and it obviously should do what it says and check the existence of the bucket for us if we ask it to, and maybe raise an error otherwise, but the latter is a matter of taste. Sometimes I'd rather like to have an error instead of having to check for nil, since in this case you could argue that it clearly is an exceptional situation.

bucket('bucket-for-monsieur', :create => true, :permissions => 'public')

Designing a good API is hard, especially when you're hiding one that's already not great, though that's not an excuse. But sometimes it's well worth looking at how you think others would want to use your library, putting aside the fact that you need to do more work maybe to make it more usable. It's not too much to ask, and it'll sure make your users happy, because their code will look nicer too. The only failure I could add to the above is when the methods would have the real names of the API calls:

DescribeInstances(['i-123454'])

Gross! You can laugh now, but some Ruby libraries for accessing Amazon's Web Services don't shy away from exposing you to these undoubtedly not Ruby-like method names.

Call it NoSQL, call it post-relational, call it what you like, but it's hard to ignore that hings are happening in the database world. A paradigm shift is not too far ahead, and it's a big one, and I for one am welcoming our post-relational overlords. Whatever you call them, CouchDB, MongoDB (although you really shouldn't call a database MongoDB), Cassandra, Redis, Tokyo Cabinet, etc. I'm well aware that they're not necessarily all the same, but they do try to fill similar gaps. Making data storage easy as pie, offering data storage fitting with the kind of evolving data we usually find on the web.

The web is slowly moving away from relational databases and with that, SQL. Let me just say it upfront: I hate SQL. I physically hate it. It doesn't fit with my way of thinking about problems, and it doesn't fit with the web. That's my opinion anyway, but I'm sure I'm not alone with it.

There is however one area where object-oriented databases failed, and where the new generation of document databases will have similar problems. You could argue that object-oriented databases are in some way a predecessor to modern post-relational databases, they made storing objects insanely easy, no matter how complex they were, and they made navigating through objects trees even easier and insanely fast. Which made them applicable to some problems, but they weren't flexible enough in my opinion. But they still laid some groundwork.

skitched-20090908-165230.jpg

It's mainly concerning The Enterprise and their giant collection of reporting tools. Everybody loves tools, and The Enterprise especially loves them. The more expensive, the better. Reporting tools are the base for those awesome pie charts they just love to fill entire PowerPoint presentations with. They work on "standardized" interfaces and languages and therefore, with SQL.

I've worked on a project were we switched from an object-oriented to a relational database just because of that. Sure, there's proprietary query languages, or there's JQL when you're into JDO, EJB3 and the like. But they're nowhere as powerful as SQL is. They're also not as brain-twisting. That should be a good thing really, but there you have it.

NoSQL databases are facing a similar dilemma. Just like object-oriented databases they're awesome for just dumping data in it, more or less structured. It's easy to get them out too, and it's usually easy to aggregate the data in some way. Is it a big deal? Of course not, at least not in my opinion. But if it is some sort of deal, what can you do to work around that?

  • Ignore it. Simple, isn't it? The reporting requirement can usually be solved in a different way. Sure, it can be more work, but usually reporting is less of a killer than some might think. Give the client some way to express a query and let him at it. Give him a spare instance of your replicated database, and let him work off that data. Best thing you could do is pre-aggregate it as much as possible so there's less work for the client.

  • If you really need structured data in a relational database, consider replicating the data into one from your post-relational database of choice. I can hear you say: That guy's crazy, that'd involve so much work keeping the two in sync! No, it wouldn't. Create a fresh dump every time you need a current dataset, and dump it into your SQL database. Simple like that.

  • Put an interface in front of the new database. Yes, it's insane, but I've done it, and it works. It doesn't have to be an SQL interface, just a common interface that works with one set of reporting tools. Yes, it's not ideal, but it's an option.

  • Don't ignore it, keep using a relational database. Yep, not all of us are lucky enough, someone still has to serve the market demands. Legacy projects or clients are forcing us to stick with the old and the dusty model of storing and retrieving data. Quite a lot of people are happy with that, but I'm not.

I'm sure there's other options, these are just off the top of my head, and I can say that I've practiced all of them with more or less good results.. I for one am sick of still having to use MySQL on new projects. I've had my fun with it, and sure there's a whole bunch of patches that make it a bit more fun, but it's still MySQL. Yes, I am aware that there's PostgreSQL, but it's the same story. Old, old and old.

Should you still try to get a new generation database into new projects? Yes, yes and yes, you definitely should. Consider yourself lucky if you succeed, because you're still an early awesome adopter. Even use SimpleDB if you must, but maybe reconsider before you really use it, it's not great. But don't lie to your clients, they should be aware what they're getting into. It's no big deal, but the bigger they are the more likely they have administrators not yet familiar with the new tools. But the more people start using them now, the better they'll get before they hit the mainstream. Which they will eventually, rest assured. I'm ready, the web is ready, and the tools are ready. What about you?

Tags: nosql, databases