The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?

You apply a hash function to the key. That's it? Yeah, that's the whole deal of consistent hashing. It's in the name, isn't it?

The same key will always return the same hash code (hopefully), so once you've figured out how you spread out a range of keys across the nodes available, you can always find the right node by looking at the hash code for a key.

It's pretty ingenious, if you ask me. It was cooked up in the lab chambers at Akamai, back in the late nineties. You should go and read the original paper right after we're done here.

Consistent hashing solves the problem people desperately tried to apply sharding to pretty nicely and elegantly. I'm not going to bore you with the details on how exactly consistent hashing works. Mike Perham does a pretty good job at that already, and there are many more blog posts explaining implementations and theory behind it. Also, that little upcoming book of mine has a full-length explanation too. Here's a graphic showing the basic idea of consistent hashing, courtesy of Basho.

Consistent Hashing

Instead I want to look at the practical implications of consistent hashing in distributed databases and cache farms.

Easier to Avoid Hotspots

When you put data on nodes based on a random result, which is what the hash function calculates, a value that's a lot more random than the key it's based on, it's easier to avoid hotspots. Why?

Assume a key based on an increasing value, or a simple range of keys, based on the hour of the day, like 2011-12-11-13. You add new hours and therefore new data as time passes, and keys are stored based on the range they fall in. For example, the keys 2011-12-11-18 until 2011-12-11-23 are stored on the same node, with the rest of the keys stored on other nodes, just because the ranges or the partitioning scheme happen to be set up this way.

For a consumer-facing site, the evening hours are usually the busiest time of the day. They create more data, more writes, and possibly more reads too. For the hours between 18:00 and 23:00, all the load goes to the single node that carries all the relevant data.

But when you determine the location in the cluster based solely on the hash of the key, chances are much higher that two keys lexicographically close to each other end up on different nodes. Thus, the load is shared more evenly. The disadvantage is that you lose the order of keys.

There are partitioning schemes that can work around this, even with a range-based key location. HBase (and Google's BigTable, for that matter) stores ranges of data in separate tablets. As tablets grow beyond their maximum size, they're split up and the remaining parts re-distributed. The advantage of this is that the original range is kept, even as you scale up.

Consistent Hashing Enables Partitioning

When you have a consistent hash, everything looks like a partition. The idea is simple. Consistent hashing forms a keyspace, which is also called continuum, as presented in the illustration. As a node joins the cluster, it picks a random number, and that number determines the data it's going to be responsible for. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node. The resulting partition could be of any size theoretically. It could be a tiny slice, or a large one.

First implementations of consistent hashing still had the problem that a node picking a random range of keys resulted in one node potentially carrying a larger keyspace than others, therefore still creating hotspots.

But the improvement was as simple as it was ingenious. A hash function has a maximum result set, a SHA-1 function has a bit space of 2^160. You do the math. Instead of picking a random key, a node could choose from a fixed set of partitions, like equally size pizza slices. But instead of picking the one with the most cheese on, everyone gets an equally large slice. The number of partitions is picked up front, and practically never changes over the lifetime of the cluster.

For good measure, here's a picture of a sliced pizza.

Consistent Pizza

Partitioning Makes Scaling Up and Down More Predictable

With a fixed number of partitions of the same size, adding new nodes becomes even less of a burden than with just consistent hashing. With the former, it was still unpredictable how much data had to be moved around to transfer ownership of all the data in the range of the new node. One thing's for sure, it already involved a lot less work than previous methods of sharding data.

With partitioning, a node simply claims partitions, and either explicitly or implicitly asks the current owners to hand off the data to them. As a partition can only contain so many keys, and randomness ensures a somewhat even spread of data, there's a lot less unpredictability about the data that needs to be transferred.

If that partitions just so happens to carry the largest object by far in you whole cluster, that's something even consistent hashing can't solve. It only cares for keys.

Going back to HBase, it cares for keys and the size of the tablet the data is stored in, as it breaks up tablets once they reach a threshold. Breaking up and reassigning a tablet requires coordination, which is not an easy thing to do in a distributed system.

Consistent Hashing and Partitioning Enable Replication

Consistent hashing made one thing a lot easier: replicating data across several nodes. The primary means for replication is to ensure data survives single or multiple machine failures. The more replicas you have, the more likely is your data to survive one or more hardware crashes. With three replicas, you can afford to lose two nodes and still serve the data.

With a fixed set of partitions, a new node can just pick the ones it's responsible for, and another stack of partitions it's going to be a replica for. When you really think about it, both processes are actually the same. The beauty of consistent hashing is that there doesn't need to be master for any piece of data. Every node is simply a replica of a number of partitions.

But replication has another purpose besides ensuring data availability.

Replication Reduces Hotspots (Even More!!!)

Having more than one replica of a single piece of data means you can spread out the request load even more. With three replicas of that data, residing on three different nodes, you can now load-balance between them. Neat!

With that, consistent hashing enables a pretty linear increase in capacity as you add more nodes to a cluster.

Consistent Hashing Enables Scalability and Availability

Consistent hashing allows you to scale up and down easier, and makes ensuring availability easier. Easier ways to replicate data allows for better availability and fault-tolerance. Easier ways to reshuffle data when nodes come and go means simpler ways to scale up and down.

It's an ingenious invention, one that has had a great impact. Look at the likes of Memcached, Amazon's Dynamo, Cassandra, or Riak. They all adopted consistent hashing in one way or the other to ensure scalability and availability.

Want to know more about distributed databases in general and Riak in particular? You'll like the Riak Handbook, a hands-on guide full of practical examples and advice on how to use Riak to ensure scalability and availability for your data.

In the next installment we're looking at the consequences and implications of losing key ordering in a Riak cluster.

Tags: nosql, riak

A couple of months ago I set out to write a book on NoSQL. It's about time I give an update on how it's been going, and when you can expect a book in your hands, or rather, on your screen.

After an initial burst of writing, I took somewhat of a break, so please excuse the delay in general. I spent the last weeks writing (a lot), and I'm currently trying to polish and finish up the existing content so that I can throw something out there for the world to peek at. Currently, the book covers MongoDB, Riak and Redis in varying detail, and I'm working to finish up loose ends to get into a good shape, before I'm starting work on more chapters.

A lot of people have asked me about pricing, distribution model, updates, and so on, so I'm following up with an FAQ section. In general, I'm as keen to get something out there as people have expressed their interest in reading it, believe me.

Turns out though: writing a book is hard. It takes a lot of work, a lot of discipline and creativity to come up with the right words and code examples. I'm not complaining, it's just something you don't realize from writing even slightly longer blog posts. It's still an incredible learning experience too, because I (and you) get to play with pretty much all of the features the databases covered have to offer.

So bear with me, I'm on it.

The book is not built around the idea that a big application is to be built with each database. I'm not a fan of that approach myself, as it makes it too easy to lose track of details. It's full of small examples, focused on specific features.

How many pages does it have?

As the book is still growing, and I'm still playing with layouting details, I can't give you an exact number, but the final book is probably going to have more than 200 pages.

What's the pricing going to be?

I haven't decided yet. It's not going to be in the single digits pricing range, and as the book is pretty dense with content, I don't want to undercharge. I'll keep you posted.

Will there be early access to the book?

Yes, there will be. You'll be able to buy the beta of the book for a reduced price, and follow the updates. Maybe even the commits on GitHub? I don't know. Let me know if that's something you're interested in.

Do you have some samples I can peek at?

Not yet. Layout is still far from final, but I'll throw something out as soon as an early access will be available.

What databases are being covered?

To reach my goal for a final release, I'm covering Redis, MongoDB, CouchDB, Riak, and Cassandra, all in varying detail. For some it makes more sense to go deeper than for others.

Are future updates included?

Yes, as content gets added, typos get fixed, and new databases pop up, I'll send updates to everyone buying the book. The updates are free. Consider buying the book a subscription for more chapters on other databases.

Are you extending the book with more databases over time?

Yes, I have an insatiable thirst to play with more databases, and I don't want to deprive you of experiencing that too.

Are you covering database SomeDB? I hear it's the next big thing!

For now, the list of databases is fixed. What's coming after that, on the other hand, is not. I'm open to suggestions, but I'd prefer some non-exotic over a very domain-specific database you wrote for a recent project. I'll set up some sort of voting when the final release is done.

What formats will be available?

I'm currently working on PDF and ePub, with Kindle to follow. Gotta have my priorities. A good-looking and readable PDF is my first priority, an ePub after that. Buying the book includes access to all formats.

Is there going to be a print edition?

Print is not a priority right now.

What are you using to write and generate the book?

The book is written in Markdown (I hate LaTeX), converted to HTML using Redcarpet, using Albino for syntax-highlighting, and converted to PDF using the awesome Prince XML library, I hope to eventually use DocRaptor to create the final result, as a Prince license is slightly out of budget, but DocRaptor is pretty affordable.

Where can I get updates on progress?

Mostly be following me or the handbook itself on Twitter.

Tags: nosql, books

This post is not about devops, it's not about lean startups, it's not about web scale, it's not about the cloud, and it's not about continuous deployment. This post is about you, the developer who's main purpose in life has always been to build great web applications. In a pretty traditional world you write code, you write tests for it, you deploy, and you go home. Until now.

To tell you the truth, that world has never existed for me. In all of my developer life I had to deal with all aspects of deployment, not just putting build artifacts on servers, but dealing with network outages, faulty network drivers, crashing hard disks, sudden latency spikes, analyzing errors coming from those pesky crawling bots on that evil internet of yours. I take a lot of this for granted, but working in infrastructure and closely with developers trying to get applications and infrastructure up and running on EC2 has taught me some valuable lessons to assume the worst. Not because developers are stupid, but because they like to focus on code, not infrastructure.

But here's the deal: your code and all your full-stack and unit tests is worth squat if they're not running out there on some server or infrastructure stack like Google Apps or Heroku. Without running somewhere in production, your code doesn't generate any business value, it's just a big pile of ASCII or UTF-8 characters that cost a lot of money to create, but didn't offer any return of investment yet.

Love Thy Infrastructure

Operations isn't hard, but necessary. You don't need to know everything about operations to become fluent in it, you just have to know enough to start and know how to use Google.

This is my collective dump from the last years of working both as a developer and that guy who does deployments and manages servers too. Most are lessons I learned the hard way, others just seemed logical to me when I learned about them the first time around.

Between you and me, having this skill set at hand makes you a much more valuable developer. Being able to analyze any problem in production and at least having a basic skill set to deal with it makes you a great asset for companies and clients to hold on to. Thought you should know, but I digress.

The most important lesson I can tell you right up front: love your infrastructure, it's the muscles and bones of your application, whereas your code running on it is nothing more than the skin.

Without Infrastructure, No-one Will Use Your Application

Big surprise. For users to be able to enjoy your precious code, it needs to run somewhere. It needs to run on some sort of infrastructure, and it doesn't matter if you're managing it, or if you're paying another company to take care of it for you.

Everything Is Infrastructure

Every little piece of software and hardware that's necessary to make your application available to users is infrastructure. The application server serving and executing your code, the web server, your email delivery provider, the service that tracks errors and application metrics, the servers or virtual machines your services are running on.

Every little piece of it can break at any time, can stall at any time. The more pieces you have in your application puzzle, the more breaking points you have. And everything that can break, will break. Usually not all at once, but most certainly when it's the least expected, or just when you really need your application to be available.

On Day One, You Build The Hardware

Everything starts with a bare metal server, even that cloud you've heard so much about. Knowing your way around everything that's related to setting up a full rack of servers on a single day, including network storage a fully configured switch with two virtual LANs and a master-slave database setup using a RAID 10 a bunch of SAS drives might not be something you need every day, but it sure comes in handy.

The good news is the internet is here for you. You don't need to know everything about every piece of hardware out there, but you should be able to investigate strengths and weaknesses, when an SSD is an appropriate tool to use, and when SAS drives will kick butt.

Learn to distinguish the different levels of RAID, why having an additional file system buffer on top of a RAID that doesn't have a backup battery for its own, internal write buffer is a bad idea. That's a pretty good start, and will make decisions much easier.

The System

Do you know what swap space is? Do you know what happens when it's used by the operating system, and why it's actually a terrible thing and gives a false sense of security? Do you know what happens when all available memory is exhausted?

Let me tell you:

  • When all available memory is allocated, the operating system starts swapping out memory pages to swap space, which is located on disk, a very slow disk, slow like a snail compared to fast memory.
  • When lots of stuff is written to and read from swap space on disk, I/O wait goes through the roof, and processes start to pile up waiting for their memory pages to be swapped out to or read from disk, which in turn increases load average, and almost brings the system to a screeching halt, but only almost.
  • Swap is terrible because it gives you a false sense of having additional resources beyond the available memory, while what it really does is slowing down performance in a way that makes it almost impossible for you to log into the affected system and properly analyze the problem.

This is basically operations level on the operating system level. It's not much you need to know here, but in my opinion it's essential. Learn about the most important aspects of a Unix or Linux system. You don't need to know everything, you don't need to know the specifics of Linux' process scheduler or the underlying datastructure used for virtual memory. But the more you know, the more informed your decisions will be when the rubber hits the road.

And yes, I think enabling swap on servers is a terrible idea. Let processes crash when they don't have any resources left. That at least will allow you to analyze and fix.

Production Problems Don't Solve Themselves

Granted, sometimes they do, but you shouldn't be happy about that. You should be willing to dig into whatever data you have posthumous to find whatever went wrong, whatever caused a strange latency spike in database queries, or caused an unusually high amount of errors in your application.

When a problem doesn't solve itself though, which is certainly the common case, someone needs to solve it. Someone needs to look at all the available data to find out what's wrong with your application, your servers or the network.

This person is not the unlucky operations guy who's currently on call, because let's face it, smaller startups just don't have an operations team.

That person is you.

Solve Deployment First

When the first line of code is written, and the first piece of your application is ready to be pushed on a server for someone to see, solve the problem of deployment. This has never been easier than it is today, and being able to push incremental updates from then on speeds up development and the customer feedback cycle considerably.

As soon as you can, build that Capfile, Ant file, or whatever build and deployment tools you're using, set up servers, or set up your project on an infrastructure platform like Scalarium, Heroku, Google Apps, or dotCloud. The sooner you solve this problem, the easier it will be to finally push that code of yours into production for everyone to use. I consider application deployment a solved problem. There's no reason why you shouldn't have it in place even in the earliest stages of a project.

The more complex a project gets over even just its initial lifecycle the easier it will be to add more functionality to an existing deployment setup instead of having to build everything from scratch.

Automate, Automate, Automate

Everything you do by hand, you should only be doing once. If there's any chance that particular action will be repeated at some point, invest the time to turn it into a script. It doesn't matter if it's a shell, a Ruby, a Perl, or a Python script. Just make it reusable. Typing things into a shell manually, or updating configuration files with an editor on every single server is tedious work, work that you shouldn't be doing manually more than once.

When you automate something once, it not only greatly increases execution speed the second and third time around, it reduces the chance of failure, of missing that one important step.

There's an abundance of tools available to automate infrastructure, hand-written script are only the simplest part of it. Once you go beyond managing just one or two servers, tools like Chef, Puppet and MCollective come in very handy to automate everything from setting up bare servers to pushing out configuration changes from a single point, to deploying code. Everything should be properly automated with some tool. Ideally you only use one, but looking at Chef and Puppet, both have their strength and weaknesses.

Changes in Chef aren't instant, unless you use the command line tool knife, which assumes SSH access to all servers you're managing. The bigger your organizations the less chance you'll have to be able to access all machines via SSH. Instant tools like mCollective that work based on a push agent system, are much better for these instant kinds of activities.

It's not important what kind of tool you use to automate, what's important is that you do it in the first place.

By the way, if your operations team restricts SSH access to machines for developers, fix that. Developers need to be able to analyze and fix incidents just like the operations folks do. There's no valid point in denying SSH access to developers. Period.

Introduce New Infrastructure Carefully

Whenever you add a new component, a new feature to an application, you add a new point of failure. Be it a background task scheduler, a messaging queue, an image processing chain or asynchronous mail delivery, it can and it will fail.

It's always tempting to add shiny new tools to the mix. Developers are prone to trying out new tools even though they've not yet fully proven themselves in production, or experience running them is still sparse. It's a good thing in one way, because without people daring to use new tools everyone else won't be able to learn from their experiences (you do share those experiences, do you?).

But on the other hand, you'll live the curse of the early adopter. Instead of benefiting from existing knowledge, you're the one bringing the knowledge into existence. You'll experience all the bugs that are still lurking in the darker corners of that shiny new database or message queue system. You'll spend time developing tools and libraries to work with the new stuff, time you could just as well be spending working on generating new business value by using existing tools that do the job similarly well. If you do decide for a new tool, be prepared to degrade back to other tools in the case of failure.

No matter if old or new, adding more infrastructure always has the potential for more things to break. Whenever you add something, be sure to know what you're getting yourself into, be sure to have fallback procedures in place, be sure everyone knows about the risks and the benefits. When something that's still pretty new breaks, you're usually on your own.

Make Activities Repeatable

Every activity in your application that causes other, dependent activities to be executed, needs to be repeatable, either by the user, or through some sort of administrative interface, or automatically if feasible. Think user confirmation emails, generating monthly reports, background tasks like processing uploads. Every activity that's out of the normal cycle of fetching records from a datasource and updating them is bound to fail. Heck, even that cycle will fail at some point due to some odd error that only comes up every once in a blue moon.

When an activity is repeatable, it's much easier to deal with outages of single components. When it comes back up, simply re-execute the tasks that got stuck.

This, however, requires one important thing: every activity must be idempotent. It must have the same outcome no matters how often it's being run. It must know what steps were already taken before it broke the last time around. Whatever's already been done, it shouldn't be done again. It should just pick up where it left off.

Yes, this requires a lot of work and care for state in your application. But trust me, it'll be worth it.

Use Feature Flips

New features can cause joy and more headaches. Flickr was one of the first to add something called feature flips, a simple way to enable and disable features for all or only specific users. This way you can throw new features onto your production systems without accidentally enabling it for all users, you can simply allow a small set of users or just your customer to use it and to play with it.

What's more important though, when a feature breaks in production for some reason, you can simply switch it off, disabling traffic on the systems involved, allowing you to take a breether and analyze the problem.

Feature flips come in many flavors, the simplest approach is to just use a configuration file to enable or disable them. Other approaches use a centralized database like Redis for that purpose, which has an added benefit for other parts of your application, but also adds new infrastructure components and therefore, more complexity and more points of failure.

Fail And Degrade Gracefully

What happens when you unplug your database server? Does your application throw in the towel by showing a 500 error, or is it able to deal with the situation and show a temporary page informing the user of what's wrong? You should try it and see what happens.

Whenever something non-critical breaks, your application should be able to deal with it without anything else breaking. This sounds like an impossible thing to do, but it's really not. It just requires care, care your standard unit tests won't be able to deliver, and thinking about where you want a breakage to leak to the user, or where you just ignore it, picking up work again as soon as the failed component becomes available again.

Failing gracefully can mean a lot of things, there's things that directly affect user experience, a database failure comes to mind, and things that the user will notice only indirectly, e.g. through delays in delivering emails or fetching data from an external service like Twitter, RSS feeds and so on.

When a major component in your application fails, a user will most likely be unable to use your application at all. When your database latency increases manifold, you have two options. Try to squeeze through as much as you can, accepting long waits on your user's side, or you can let him know that it's currently impossible to serve him in an acceptable time frame, and that you're actively working on fixing or improving the situations. Which you should, either way.

Delays in external services or asynchronous tasks are much harder for a user to notice. If fetching data from an external source, like an API, directly affects your site's latency, there's your problem.

Noticing problems in external services requires two things: monitoring and metrics. Only by tracking queue sizes, latency for calls to external services, mail queues and all things related to asynchronous tasks will you be able to tell when your users are indirectly affected by a problem in your infrastructure.

After all, knowing is half the battle.

Monitoring Sucks, You Need It Anyway

I've written in abundance on the virtues of monitoring, metrics and alerting. I can't say it enough how important having a proper monitoring and metrics gathering system in place is. It should be by your side from day one of any testing deployment.

Set up alerts for thresholds that seem like a reasonable place to start to you. Don't ignore alerting notifications, once you get into that habit, you'll miss that one important notification that's real. Instead, learn about your system and its thresholds over time.

You'll never get alerting and thresholds right the first time, you'll adapt over time, identifying false negatives and false positives, but if you don't have a system in place at all, you'll never know what hit your application or your servers.

If you're not using a tool to gather metrics like Munin, Ganglia, New Relic, or collectd, you'll be in for a big surprise once your application becomes unresponsive for some reason. You'll simply never find out what the reason was in the first place.

While Munin has basic built-in alerting capabilities, chances are you'll add something like Nagios or PagerDuty to the mix for alerting.

Most monitoring tools suck, you'll need them anyway.

Supervise Everything

Any process that's required to be running at any time needs to be supervised. When something crashes be sure there's an automated procedure in place that will either restart the process or notify you when it can't do so, degrading gracefully. Monit, God, bluepill, supervisord, RUnit, the number of tools available to you is endless.

Micromanaging people is wrong, but processes need that extra set of eyes on them at all times.

Don't Guess, Measure!

Whatever directly affects your users' experience affects your business. When your site is slow, users will shy away from using it, from generating revenue and therefore (usually) profit.

Whenever a user has to wait for anything, they're not willing to wait forever. If an uploaded video takes hours to process, they'll go to the next video hosting site. When a confirmation email takes hours to be delivered, they'll check out your competitor, taking the money with them.

How do you know that users have to wait? Simple, you track how long things in your application take, how many tasks are currently stuck in your processing queue, how long it took to process an upload. You stick metrics on anything that's directly or indirectly responsible for generating business value.

Without having a proper system to collect metrics in place, you'll be blind. You'll have no idea what's going inside your application at any given time. Since Coda Hale's talk "Metrics Everywhere" at CodeConf and the release of his metrics library for Scala, an abundance of libraries for different languages has popped up left and right. They make it easy to include timers, counters, and other types of metrics into your application, allowing you to instrument code where you see fit. Independently, Twitter has lead the way by releasing Ostrich, their own Scala library to collect metrics. The tools are here for you. Use them.

The most important metrics should be easily accessible on some sort of dashboard. You don't need a big fancy screen in your office right away, a canonical place, e.g. a website including the most important graphs and numbers, where everyone can go and see what's going on with a glance is a good start. Once you have that in place, the next step towards a company-visible dashboard is simple buying a big-ass screen.

All metrics should be collected in a tool like Ganglia, Munin or something else. These tools make analysis of historical data easy, they allow you to make predictions or correlate the metrics gathered in your applications to other statistics like CPU, memory usage, I/O waits, and so on.

The importance of monitoring and metrics cannot be stressed enough. There's no reason why you shouldn't have it in place. Setting up Munin is easy enough, setting up collection using an external service like New Relic or Scout is usually even easier.

Use Timeouts Everywhere

Latency is your biggest enemy in any networked environment. It creeps up on you like the shadow of the setting sun. There's a whole bunch of reasons why, e.g. database queries will suddenly see a spike in execution time, or external services suddenly take forever to answer even the simplest requests.

If your code doesn't have appropriate timeouts, requests will pile up and maybe never return, exhausting available resources (think connection pools) faster than Vettel does a round in Monte Carlo.

Amazon for example has internal contracts. Going to their home page involves dozens of requests to internal services. If any one of them doesn't respond in a timely manner, say 300 ms, the application serving the page will render a static piece snippet instead, but thereby decreasing the chance of selling something, directly affecting business value.

You need to treat every call to an external resource as something that can take forever, something that potentially blocks an application server process forever. When an application server process or thread is blocked, it can't serve any other client. When all processes and threads lock up waiting for a resource, your website is dead.

Timeouts make sure that resources are freed and made available again after a grace period. When a database query takes longer than usual, not only does your application need to know how to handle that case, but your database needs to. If your application has a timeout, but your database will happily keep sorting those millions of records in a temp file on disk, you didn't gain a lot. If two dependent resources are within your hands, both need to be aware of contracts and timeouts, both need to properly free resources when the request couldn't be served in a timely manner.

Use timeouts everywhere, but know how to handle them when they occur, know what to tell the user when his request didn't return quickly enough. There is no golden rule what to do with a timeout, it depends not just on your application, but on the specific use case.

Don't Rely on SLAs

The best service fails at some point. It will fail in the most epic way possible, not allowing any user to do anything. This doesn't have to be your service. It can be any service you directly or indirectly rely on.

Say, your code runs on Heroku. Heroku's infrastructure runs on Amazon's EC2. Therefore Heroku is prone to problems with EC2. If a provider like Heroku tells you they have a service level agreement in place that guarantees a minimum amount of availability per month or per year, that's worth squat to you, because they in turn rely on other external services, that may or may not offer different SLAs. This is not specific to Heroku, it's just an obvious example. Just because you outsourced infrastructure doesn't mean you're allowed to stop caring.

If your application runs directly on EC2, you're bound by the same problem. The same is true for any infrastructure provider you rely on, even a big hosting company where your own server hardware is colocated.

They all have some sort of SLA in place, and they all will screw you over with the terms of said SLA. When stuff breaks on their end, that SLA is not worth a single dime to you, even when you were promised to get your money back. It will never make up for lost revenue, for lost users and decreased uptime on your end. You might as well stop thinking about them in the first place.

What matters is what procedures any provider you rely on has in place in case of a failure. The important thing for you as one of their users is to not be left standing in the rain when your hosting world is coming close to an end. A communicative provider is more valuable than one that guarantees an impossible amount of availability. Things will break, not just for you. SLAs give you that false sense of security, the sense that you can blame an outage on someone else.

For more on this topic, Ben Black has written a two part series aptly named "Service Level Disagreements".

Know Your Database

You should know what happens inside your database when you execute any query. Period. You should know where to look when a query takes too long, and you should know what commands to use to analyze why it takes too long.

Do you know how an index is built? How and why your database picks one index over another? Why selecting a random record based on the wrong criteria will kill your database?

You should know these things. You should read "High Performance MySQL", or "Oracle Internals", or "PostgreSQL 9.0 High Performance". Sorry, I didn't mean to say you should, I meant you must read them.

Love Your Log Files

In case of an emergency, a good set of log files will mean the world to you. This doesn't just include the standard set of log files available on a Unix system. It includes your application and all services involved too.

Your application should log important events, anything that may seem useful to analyze an incident. Again, you'll never get this right the first time around, you'll never know up front all the details you may be interested in later. Adapt and improve, add more logging as needed. It should allow you to tune the log verbosity at runtime, either by a using a feature switch or by accepting a Unix signal.

Separate request logging from application logging. Data on HTTP requests is just as important as application logs, but it's easier if you can sift through them independently, they're also a lot easier to aggregate for services like Syslog or Loggly when they're on their own.

For you Rails developers out there: using Rails.logger is not an acceptable logging mechanism. All your logged statements will be intermingled with Rails next to unusable request logging output. Use a separate log file for anything that's important to your application.

Just like you should stick metrics on all things that are important to your business, log additional information when things get out of hand. Correlating log files with metrics gathered on your servers and in your application is an incredibly powerful way of analyzing incidents, even long after they occurred.

Learn the Unix Command Line

In case of a failure, the command line will be your best friend. Knowing the right tools to quickly sift through a set of log files, being able to find and set certain kernel parameters to adjust TCP settings, knowing how get the most important system statistics with just a few commands, and knowing where to look for a specific service's configuration. All these things are incredibly valuable in case of a failure.

Knowing your way around a Unix or Linux system, even with just a basic toolset is something that will make your life much easier, not just in operations, but also as a developer. The more tools you have at your disposal, the easier it will be for you to automate tasks, to not be scared of operations in general.

In times of an emergency, you can't afford to argue that your favorite editor is not installed on a system, you use what's available.

At Scale, Everything Breaks

Working at large scale is nothing anyone should strive for, it's a terrible burden, but an incredibly fascinating one. The need for scalability evolves over time, it's nothing you can easily predict or assume without knowing all the details, parameters and the future. Out of all three, at least one is 100% guess work.

The larger your infrastructure setup gets, the more things will break. The more servers you have, the larger the number of servers being not available at any time. That's nothing you need to respect right from the get go, it's something to keep in mind.

No service that's working at a larger scale was originally designed for it. The code and infrastructure were adapted, the services grew over time, and they failed a lot. Something to think about when you reach for that awesome scalable database before even having any running code.

Embrace Failure

The bottom line of everything is, stuff breaks, everything breaks at different scale. Embrace breakage and failure, it will help you learn and improve your knowledge and skill set over time. Analyze incidents using the data available to you, fix the problem, learn your lesson, and move on.

Don't embrace one thing though: never let a failure happen again if you know what caused it the first time around.

Web operations is not solely related to servers and installing software packages. Web operations involves everything required to keep an application available, and your code needs to play along.

Required Reading

As 101s go, this is a short overview of what I think makes up for a good starter set of operations skills. If you don't believe or trust me (which is a good thing), here's a list of further reading for you. By now, I consider most of these required reading even. The list isn't long, mind you. The truth as of today is still that you learn the most out of personal experience on production systems. Both require one basic skill though: you have to want to learn.

Shameless Plug

If you liked this article, you may enjoy the book I'm currently working on: "The NoSQL Handbook".

Tags: operations