The need to measure everything that moves in a distributed system or even simple web apps is becoming the basis for thorough monitoring of an application.

However, there is one thing that's starting to get in the way of of getting good measurements of all layers in a system: client libraries used to talk to network services, be it the database, an API, a message bus, anything that's bound to the intricate latency variances of the network stack.

Without full instrumentation of all parts of the application's stack, it's going to be very hard to figure out where exactly a problems boils down to. Measuring client access to a network service in addition to collecting data on the other end, e.g. the slow query log, allows you to pinpoint issues to the network, to increased latency, or to parsing responses.

If the other end is not under your control, it's just as important to have this data available. Having good metrics on request latencies to an external service, even a database hosted by a third party, gives you a minimum amount of confidence that while you maybe can't fix the underlying problem, you at least have the data to show where the problem is most likely to be. Useful data to have when approaching the third party vendor or hosting company about the issue.

Rails has set a surprisingly good example, by way of ActiveSupport::Notifications. Controller requests are instrumented just as database queries of any kind.

You can subscribe to the notifications and start collecting them in your own metrics tool. StatsD, Graphite and Librato Metrics are pretty great tools for this purpose.

There's not much a client library needs to do to emit measurements of network requests. The ones for Ruby could start by adding optional instrumentation based on AS::Notifications. That'd ensure that ActiveSupport itself doesn't turn into a direct dependency. I'd love to see the notifications bit being extracted into a separate library that's easier to integrate than pulling in the entire ActiveSupport ball of mud.

Node.js has EventEmitters, which are similar to AS::Notifications, and they lend themselves quite nicely for this purpose.

I've dabbled with this for riak-js, the Node.js library for Riak. There's an example that shows how to register and collect the metrics from the events emitted. The library itself just emits the events at the right spot, adds some timestamps so that event listeners can reconstruct the trail of a request.

It worked out pretty well and is just as easy to plug into a metrics library or to report measurements directly to StatsD.

The thing that matters is that any library for a network service you write or maintain, should have some sort of instrumentation built in. Your users and I will be forever grateful.

This goes both ways, too. Network servers need to be just as diligent in collecting and exposing data as the client libraries talking to them. Historically, though, a lot of servers already expose a lot of data, not always in a convenient format, but at least it's there.

Build every layer of your application and library with instrumentation in mind. Next time you have to tackle an issue in any part of the stack, you'll be glad you did.

Now go and measure everything!

Tags: monitoring

When I built the billing process for Travis CI's commercial offering, I decided to try out some new things to avoid callbacks in ActiveRecord models, including validations.

In 2010 I wrote about why callbacks and validations scattered about the persistence layer bother me. I recommend reading it to get the full background on this.

What I went for this time was a mix of a service layer that handles all the business logic and a layer of form objects that handle communications between the controller and the services, including handling validations.

The goal was to have simple Ruby objects to take care of these things. No special frameworks required. Inspiration in part stemmed from Django's form objects, though my implementation lacks the part that talks directly to the model, for instance to save data to the database. Quite intentionally so, as that part is up to the services layer.

The last thing I wanted to avoid is having to use attr_accessible in the persistence layer. In my view, that part is not something persistence should be concerned with. It's a contract between the controller and the services it calls into to make sure parameters are properly narrowed down to the set required for any operation.

Form Objects

For form objects, I looked at Scrivener, which was a great start. It's a very simple framework for form objects, the code could barely be simpler, but it lacks some validations, as it implements its own set.

On top of that, it doesn't tie in with Rails' form handling that well, which requires some parts of ActiveModel to work properly. Scrivener is great when you integrate it with e.g. Sinatra and your own simple set of forms.

It's so simple that I decided to take its simple parts and merge it with ActiveModel's validations support. Thanks to Rails 3, that part has been extracted out of the ActiveRecord code and can be used for anything.

The beauty of form objects is that they allow you to specify different views on the same data. Every database record wrapped by ActiveRecord can have multiple representations depending on which data is required by a specific form.

ActiveModelSimpleForms

Here's the base code for the forms, which doesn't have a name, it's just a snippet of code that's part of our Rails project:

It defines a few things that are required by Rails' form_for, but other than that it's straight-forward. It can populate form attributes based on the model handed in, which makes it suitable for re-use, for instance when editing an existing object or when validations failed on update.

Here's a sample form object:

It declares a few attributes and some validations. Thanks to ActiveModel you could use anything provided by its validations package in a form object.

By declaring the attributes a form object brings a simple means of implementing mass assignment protection without requiring any sort of sanitization and without poisoning the model with attr_accessible and jumping through hoops in tests to create valid objects to work with.

If an attribute assigned to the form doesn't exist, the assignment will fail.

In the controller...

The interaction with the controller is rather simple, no added complexity:

I'm liking this approach a lot, and it's been in use for a few months. There'll be some refinements, but the simplicity of it all is what I find to be the best part of it.

It's all just plain Ruby objects with some additional behaviours. Add a simple service layer to this, and cluttered code in the ActiveRecord model is nicely split up into lots of smaller chunks that deal with very specific concerns.

Tags: rails, web

There are queues everywhere. This is the story of a few of them. The names of the queues are made up, but their story is real nonetheless.

First Act

The first queue, we'll call it Unicorn, handles requests for information, rendering the result in a beautiful markup language that's easy to read. It sits in front of the public library building, waits for people to come in and ask for information.

Unicorn has a fixed number of peasants at its disposal to do work for it. When a request comes in, it sends one of them into the library to fetch the information. Peasants have access to a pretty big amount of data to choose from, but they have to be quick.

If one of them takes too long to fetch the information, Unicorn denies the request for information and strips the peasant of its duties on the spot, putting a new one in its place.

Unicorn is not very fail-safe though. It trades off not being able to deliver information in time for being swarmed by requests and not being able to handle them.

It also isn't very good at determining that every new peasant takes to long and to stop processing requests. It just keeps accepting them even if all of them time out.

Maybe it'd be smarter if Unicorn could be more aware of an increased number of information requests not returning the data in time and slow down processing or halt it altogether to figure out what the problem is?

Second Act

The second queue, we'll call it Octocat, handles requests from people to build something, say, a house, or a shack or a shelter, sometimes even a blue bikeshed.

To figure out what needs to be done, Octocat looks at the request's details, to determine what materials are required and which builder needs to be allocated to get the job done.

In some cases, Octocat sends a request to the warehouse to see if they have the required material in stock. Because the warehouse doesn't have a means to send a messenger back to Octocat, it's a fully automated system, it calls a hotline to check the status. It listens to Rick Astley while it's on hold, waiting for the system to get back to it.

Sometimes, there's a problem in the warehouse and Octocat is stuck for a long time, and it can't process any other requests in the meantime. It doesn't want to miss the system getting back to it, so all its focus is on this one build request.

To speed things up a little, the Octocat hired a second person. But now both of them are stuck in a waiting loop with the warehouse, not being able to process more build requests. No matter how many people Octocat's companies would hire, at some point all of them will be stuck on hold, all of them listening to Rick Astley.

Wouldn't it be better if, when the Octocat is waiting for the warehouse, it presses # to cancel the request, hangs up the phone and process another request in the meantime? It could just retry five minutes later to see if the system is now able to process the request.

As time passes, it can just increase the waiting time between calls, as it gets less and less likely that the warehouse will be able to process the request this time around.

Or it could put the current request to the end of the queue, and come back to it later, trying to go through the process again at a later point in the day. Maybe the warehouse just has a problems finding information on this particular material, and other requests that don't require it will work out just fine.

If the warehouse is unable to process an increased number of requests, maybe Octocat should just cease calls altogether to give the warehouses' employees time to clear things up and to process what has piled up in their inbox.

Third Act

The third queue processes long texts that were, for efficiency reasons, split up into smaller chunks. They're usually send in Morse code for bandwidth efficiency, ready to be turned back into texts.

We'll call it Logger. Logger has strict requirements to process the chunks. He needs to put them back together very quickly, otherwise the readers on the other side will be unhappy, waiting for new text to appear. They're fast readers, so Logger has to make sure he delivers in a timely fashion.

The queue has to go through a lot of text, and it has to make sure that it processes it in the correct order. Otherwise the text wouldn't make sense anymore, things putting context out of.

Logger relies on strict ordering of the messages it processes. It relies on several minions to put the texts back together after they were processed. To make sure ordering is properly applied, one minion always processes chunks from a specific text.

Logger uses the text's title to figure out which minion is responsible. This has the advantage that Logger can call in more minions as more texts are coming in. As titles vary pretty wildly, Logger can just assume that work will be distributed efficiently enough. Of course there's still the chance that one minion gets a lot of longer texts, compared to the others, but overall, it should be fine.

There is one downside to this system. Logger has to know the exact number of minions upfront. If one of them calls in sick, he has to find a replacement quickly, so that work on this minion's desk doesn't pile up.

If he can't find a replacement quickly, he has to reassign all the numbers and redistribute the work on their desks, which is a very dreadful process.

What if Logger could group minions so that they form subdivisions, each controlled by a supervisor of their own, who in turn distributes the work on his team of minions.

With little groups, he can rely on the supervisor to increase and decrease the number of minions as needed. Logger would be oblivious to their shift schedule.

To split up the work more efficiently, Logger could also rely on the first letter of the title, splitting the alphabet into smaller sub-alphabets, e.g. A-E, F-M, and so on. He assigns the ranges directly to groups, and he can, as groups come and go for their shifts, quickly reallocate ranges of letters to new groups. That still means that work has to be distributed, but Logger adds a group of messengers to the process that can shift stacks of texts quickly from one group to the other.

If one group for some reason becomes unavailable, Logger could just adapt the way he schedules work and burden another team with its range. That might overall be a bit slower, but work would still be spread out evenly across the remaining groups.

Logger still has to make sure that all groups are on the same floor though, so that the messengers don't have to climb stairs to lengthen the latency of redistributing the texts.

If Logger wasn't bound to having to process texts with very low latency, he could even consider placing groups in different buildings. If a fire breaks out in one of them, the other groups could still continue processing.

Fourth Act

The fourth queue is a builder, we'll call it Bob. Bob builds garages, houses and lots other things.

Bob is a sloppy builder though. He breaks things a lot, leaving windows broken, plaster with holes and floors uncleaned. Sometimes he even forgets to put a tile in, so that it leaves an empty area on the wall. Or he drops one of his tools on the floor, leaving a dent in the wood.

He tends to not be too careful and just assumes that everything he does turns out right. He pours concrete when it's raining, he leaves

Bob needs to get a grip and make sure his tasks are processed correctly. How could he do that?

Instead of ignoring mistakes, he could learn to accept them and take the appropriate measures to make sure he cleans up. If he notices that he breaks things to often, he could slow down his work and make sure he gets it right. Or he could go out for a coffee and come back when he's a bit more confident that he'll get the job right.

If things are really bad, he can even start from scratch, to make sure the end result is good. That might mean that processing can slow down, but that Bob is aware of his own failures. His mindset would change to making sure he gets the task right instead of leaving a mess everywhere he goes.

Bob's customers would be a lot happier if he did. It'd cost him more resources but he'd make a lot of people much happier, leaving every place he's worked on clean.

Queues, queues everywhere!

What have all the queues in this story in common? They fail to exponentially back off when they encounter errors in processing requests. They fail to make sure to not lose messages when processing them failed. They fail to retry when delivering a message has failed. They fail to make sure their processing is idempotent. They assume that the resources required for processing the messages are always available.

There are queues everywhere. They have a tendency to cause problems when being used. We just assume they work all the time, and we just assume that we're able to process everything they throw at us in a timely fashion?

We do have the best of intentions, but they usually turn out wrong. When a queue starts to become the central backbone of a system, careful steps need to be taken that the system can handle backpressure, increased failure rates, and the queue itself being unavailable.

Maybe we should start building our queues and the processes around it with the worst in mind and adjust our thinking accordingly? It's not queues, queues everywhere. It's failures, failures everywhere! Queues have a tendency to intensify failures by adding a less predictable element to our infrastructure. As Rick Branson put it:

"Keeping distributed systems running smoothly seems to be mostly about figuring out ways to not DDoS yourself."

A queue is a lot of fun until you're unable to keep up with what it's throwing at you, until your database's capacity doesn't match that of the queue, until you drop messages on the floor just because something broke in the backend, or until it floods your system with so many messages it can't process anything else in the meantime.

Maybe you already knew all of that, but I sure as heck had to learn all of these lessons above the hard way, in a very small amount of time, within a matter of weeks, to be exact.

We're still working on picking up the pieces and cleaning up. There'll be less queues in the future, just as there will be a lot more of them. More on this soon!

The queue is dead, long live the queue!

Tags: messaging, queues