I gave a talk about risk and safety in engineering at the DevOps user group in Frankfurt recently.

I talked about practical drift, normalization of deviance and the general ideas of risk and how complex systems make it almost impossible to predict all possible outcomes for a system running in production. The idea of the unknown unknowns (thanks, Donnie Rumsfeld!) and Black Swans (courtesy of Nassim Taleb) also came up.

A black swan, or an unknown unknown, is an event that is not just unlikely, no one has ever seen or consider it before. It's an accumulation of events so unlikely, that them coming together is beyond the risks anyone would normally consider, 9/11 comes to mind.

I had a chat with one attendee, who suggested that, before you build a system, you look at its properties and look at the possible influences of each one, considering the possible risks of things, going further and further back the causal chain of possible events that could lead up to an incident in the system to be designed and built.

As engineers, this seems like a plausible idea to us. You sit down, you look at your system from all known angles, you measure things, you apply some math here and there.

We like to think of engineering as a predictable practice. Once something's built with the right measurements, with the right tools and with a touch of craftsmanship, it'll last.

As a German, this idea certainly appeals to me. If there's anything we enjoy doing, it's building machines or parts for machines, or build machines to build parts of other machines.

The Boeing wing test

Take this picture, for instance. It's a magnificent sight, and it's a testimony to predictive engineering. It's the infamous wing test for the Boeing 787 Dreamliner.

For the test, the plane's wings are attached to a pretty impressive contraption. They're slowly pulled upwards to find the breaking point.

This test is intended to go way beyond the circumstances commonly found during normal flight operations, up to 150% above normal levels.

There's a video from a similar stress test for the Boeing 767 too. The wings break spectacularly at 154% beyond normal levels.

The engineers are cheering. The wings were built to withstand this kind of pressure, so it's only understandable, especially for us fellow engineers, that these guys are beyond happy to see their predictions realized in this test.

Ideally, you will never see wings being bent to these extremes.

Wings are but one piece in the big, complex system that is a modern plane.

A plane operates in an environment full of uncertainty. While we like to think we can predict the weather pretty well, its behavior cannot be controlled and can change in unpredicted, maybe even unprecendented ways. It is a system in itself.

This is where we come back to the idea that risk in complex systems can be assessed upfront, when designing, before building it.

A plane, on its own already a complex system, interacts with more complex systems. The humans steering it are one of them, the organization the pilots participate in are another. The weather is yet another complex system.

The interaction points of all these systems are almost boundless.

Engineers can try to predict all the possible states of a plane's operating environment. After all, someone is programming these states and the plane's responses to them.

But they can't predict how a human operator will interpret whatever information the system is presenting to them. Operating manuals are a common means to give us much insight as possible, but they're bound to what is known to the designer of the system before it is put into production use.

This is where socio-technical systems come into play. Technology rarely stands on its own, it interacts with human operators to get the job done. Together, they form a system that's shaped and driven both by technology and the social interactions in the organization operating it.

Complex systems exist on the micro and the macro level

A plane's wing is bound to wind, jet stream, speed, the material used to build it, the flaps to adjust the planes altitude. But it doesn't end there. It's bound to the care that was used building it, designing it, attaching it to the plane, the care of maintaining it.

With these examples along, the wing is part of several feedback loops. In "Thinking in Systems", a feedback loop is how a system responds to changing conditions. The wing of a plane can respond to increasing pressure from upwards winds by simply bending. But as we've seen above, it can only bend so far until it snaps.

But the wing is able to balance the increasing pressure nonetheless, helping to reduce impact of increasing wind conditions on the plane.

The wing then interacts with the plane, with its wheels, with its speed, its jet engines, its weight. The plane interacts with the pilots, it interacts with the wind, with the overall weather, with everchanging conditions around it.

The wing is therefore resilient. As per "Thinking in Systems":

Resilience is a measure of a system's ability to survive and persist within a variable environment. The opposite of resilience is brittleness and rigidity.

A wing is a complex system on the macro level, and it is constructed of much smaller complex systems at the micro level. It's a complex system constructed of more complex systems. It's part of even bigger complex systems (the plane), that are bound to even more complex systems (the pilot, weather conditions, jet stream, volcano ash).

These systems interact with each other through an endless amount of entry and exit points. One system feeds another system.

Quoting from "Thinking in Systems":

Systems happen all at once. They are connected not just in one direction, but in many directions simultaneously.

"Thinking in Systems" talks about stock and flow. A stock is a system's capacity to fulfill its purpose. Flow is an input and output that the system is able to respond to.

Stock is the wing itself, the material it's made of, whereas flow is a number of inputs and outputs that affect the stock. For instance, a type of input for a wing is speed of air flowing around it, another one the pressure built on it from the jet stream. The wing responds in different ways to each possible input, at least as far as it's been knowingly constructed for them.

If pressure goes up, the wing bends. If the flow of air is fast enough, the wing will drift, keeping the plane in the air.

Once you add more systems surrounding it, you increase the number of possible inputs and outputs. Some the wing knows how to respond to, others he may not.

As systems become complex, their behavior can become surprising.

The beauty of complex systems is, and this is a tough one to accept for engineers, the system can respond to certain inputs whether it was intended to do so or not.

If pushed too far, systems may well fall apart or exhibit heretofore unobserved behavior. But, by and large, they manage quite well. And that is the beauty of systems: They can work so well. When systems work well, we see a kind of harmony in their functioning.

With so many complex systems involved, how can we possibly try and predict all events that could feed into any of the systems involved, and how they then play into the other complex systems?

Our human brains aren't exactly built to follow a nearly infinite number of input factors that could contribute to an infinite number of possible outcomes.

It's easier to learn about a systems elements than about its interconnections.

The Columbia disaster

Let's dwell on the topic of wings for a minute.

During the Columbia crash on February 1, 2003, one of the low-signal problems the crew and mission control experienced were the loss of a few sensors in the left wing.

Those sensors indicated an off-scale low reading, indicating that the sensors were offline.

Going back to the launch, the left wing was the impact zone of a piece of foam the size of a suitcase, the risk of which was assessed but eventually deemed not to be hazardous to life.

The sensors went offline about five minutes before the shuttle disintegrated. Around the same time, people watching the shuttle's reentry from the ground noticed that debris being shed.

The people at mission control didn't see these pictures, they were blind to what was going on with the shuttle.

Contact with the crew and the shuttle broke off five minutes later.

Mission control had no indication that the shuttle was going to crash. Their monitoring just showed the absense of some data, not all of it, at least initially.

A wing may just be one piece, but its connections to the bigger systems it's part of can go beyond what is deemed normal. Without any visuals, would you be able to assume that the shuttle is currently disintegrating, perishing the entire crew, just by seeing that a few sensors went offline?

Constraints of building a system

When we set out to build something, we're bound by cost. Most things have a budget attached to them.

How we design and build the system is bound by these constraints, amongst others.

If we were to sit down and try to evaluate all possible outcomes, we will eventually exhaust our budget before we even started building something.

Should we manage to come up with an exhaustive catalog of possible risks, we then have to design the system in a way that protects it from all of them.

This, in turn, can have the curious outcome that our system loses resilience. Protecting itself from all possible risks could end up creating a rigid system, one that is unable to respond to emerging risks by any other means than failing.

Therein lies the crux of complex systems and their endless possibilities of interacting with each other. When we try to predict all possible interactions, there will still be even more at some point in the future.

The conditions a system was designed for are bound to change over time as it is put into production use. Increasing usage, changing infrastructure, different operations personell, to name a few.

Weather changes because of climate change, and it takes decades for the effect to have any possible impact on our plane's wings.

How complex systems fail

With a sheer infinite amount of interactions and emerging inputs increasing them even further, the system can have an incredible amount of failure modes.

But, according to Richard Cook's "How Complex Systems Fail",

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

It requires multiple failures coming together for the system to fail.

With so many systems interacting with each other, predicting how and when a combination of failures is coming together feels beyond our mental capacity.

The human factor

What then holds our systems together when they're facing uncertainy in all directions?

Surprisingly, it's the human operator. Based on ever increasing exposure to and experience operating systems in production is a human the truly adaptable element in the system's equation.

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.

What is important for any organization is that these experiences are openly shared to increase overall exposure to these systems, to bring issues to light, to improve the system as its inputs and the system's response to them change over time. Because, depending on their exposure, the knowledge of the system's behaviour under varying circumstances can be unevenly spread.

Again, quoting from Cook:

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.

Following this, maybe designing systems should focus more on building them with the human operator in mind than trying to protect them from as many possible causes of failure as possible, including the human operator.

Organization culture and risk

Assuming your organization has a good track record when it comes to safety and assessing risk. Is that an indicator that future projects are in good hands? Is a history of risk assessment and safety enough to warrant continuing safety?

According to Cook:

People continuously create safety.

Subsequently, a good safety track record is no indication for the future. Safety is not a one-time purchase, it is a continuing process that shifts between production and monetary pressure, people's work load, and any activity at the sharp end, on the production system.

The Challenger incident is an interesting example here. On January 26, 1986, the Challenger shuttle lifted off the launchpad, only to be disintegrated in the atmosphere 73 seconds later. The flight's designation was STS-51-L.

NASA, going back to the Apollo program, inarguably has a history of successfully finishing missions, even to the moon. They had good experience constructing and running hazardous equipment in production.

But, with the Shuttle program, the organization found itself in different circumstances. Stemming from the Vietnam war, budgets were cut significantly, staff shrank to about 1/3 of its original size as a consequence.

NASA relied a lot more on external contractors to work on specific parts of the Space Shuttle, just like the solid booster rockets propelling the shuttle into the atmosphere.

For budget reasons, the rockets' design was based on the Titan rocket, the booster rocket used in the Apollo program. Everyone at NASA assumed that the rockets were not only a good fit, but that there was sufficient experience with them in the organization.

Something else was different with the Shuttle program. NASA suddenly found itself under production pressure from potential customers. The program was aimed to be as economical as possible, with up to 50 launches per year to make sure that costs are fully covered by revenue. The US Army was very much interested in using the Shuttles as a means of transporting satellites and other gear into space.

Following the changes in production pressure and working with more external contractors, NASA introduced a bigger management structure. Four layers of managers and sub-managers eventually existed at NASA, with every sub-manager reporting up the stream, representing their own teams.

When the first Shuttles were launched, the team responsible for the booster rockets noticed behaviour that was different from their experience in the Apollo program.

The joints holding the parts of the rockets together were rotating, the O-rings sealing the joints of the parts either burnt through under certain circumstances, or they behaved in unexpected ways at very low temperatures. When rubber gets below certain temperatures, it stiffens up, making it unable to move an potentially fulfill its duty.

Most conditions were only seen in isolation rather than together affecting a single flight. For most of them, the team thought they understood their respective risks.

All these issues were known to the engineering teams involved, they were even considered critical to human life.

Before every launch, NASA held an assessment meeting where all critical issues were discussed. The issues found by the solid booster rockets were brought up regularly in the summaries given by their respective managers. There were slides showing notes on the issue, and the risk was discussed as well.

With every launch, the engineers learned a few new things about the behaviour of the solid booster rocket. Some of these things made it up the reporting chain, others didn't.

On the evening of the fatal Challenger launch, the teams came together to talk about the final go or no go.

A few of the engineers from the contracting companies had doubts about the launch, as the forecast for Cape Canneveral predicted very low temperatures, lower than during any previous launch of a Space Shuttle.

While the engineers voiced their concerns and initially suggested to delay the launch, management eventually overruled them and gave the go for launch.

Again from Richard Cook:

All ambiguity is resolved by actions of practitioners at the sharp end of the system.

There were a lot of miscommunication issues involved in this meeting alone, but the issue goes much deeper. The layers of management within the organization added an unintended filtering mechanisms to safety issues and risks.

During presentations in assessment and pre-launch meetings, information was usually presented in slide form. In the Challenger days, they used overhead projectors, during later years, engineers and management resorted to using PowerPoint.

Regardless of the tool, the information was presented in a denser form (denser with every management layer), using bullet points, with several things compacted into a single slide.

This had the curios effect of losing salience for the relevant information, the data that possibly could have indicated real risks rather than intermingle them with other information.

The Columbia accident suffered from similar problems. From the Columbia Accident Investigation Board's Report Vol. 1:

As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation.

Edward Tufte has written an excellent analysis of the use of PowerPoint to assess the risk of the Columbia incident. Salience and losing detail in condensed information play a big part in it.

The bottom line is that even in the most risk-aware organizations and hazardous environments, assessing safety is an incredibly hard but continuous process. Your organization can drift into a state where a risky component or behaviour becomes the norm.

In "The Challenger Launch Decision", Diane Vaughan coined the term "normalization of deviance." What used to be considered a risk has now become a normal part of the system's accepted behaviour.

Scott Snook later improved it to "practical drift", "the slow steady uncoupling of practice from written procedure."

Sidney Dekker later made it even more concrate and coined the term "drift into failure", "a gradual, incremental decline into disaster driven by environmental pressure, unruly technology and social processes that normalize growing."

How do you prevent practical drift or drift into failure? Constant awareness, uncondensed sharing of information, open feedback loops, reducing procedural friction, loose layers, involved the people at the sharp end of the action as much as possible, written reports instead of slide decks as suggested by Tufte?

Maybe all of the above. I'd be very interested in your thoughts and experiences.

In our circles, automation is most frequently associated with infrastructure, tools like Puppet and Chef, automating away your server provisioning and deployments.

While this is an important part and by now ingrained in our culture thanks to DevOps, it's by far the only part that defines automation in distributed systems.

The biggest challenge of a distributed system is to run reliably, without much human intervention, and entirely on its own.

It has to handle network partition, increased error conditions in part of the system, deal with scalability issues, provide good performance throughout.

It has to deal with failure constantly.

We'd like the impact of handling failures to be as little as possible, to require as little human involvement as possible. The distributed system ideally takes care of itself as much as possible.

We've introduced barriers, bulkheads, circuit breakers, exponential back-offs, all to make sure our distributed system doesn't kill itself. Unfortunately they have a tendency to do that, especially under conditions with increased load or increased failure rates.

This is the part where automation gets interesting and incredibly hard. This is the part where you decide how a system should behave when certain conditions occur in production. This is the part where you consider how you want humans to notice and interpret the unusual conditions and how they should respond to it.

According to "Thinking in Systems", complex systems are based on feedback loops. They continue to feed activity in the system, and changes in the feedback loop affect the outcome with every pass.

In distributed system, the most common feedback loop you'll find is a reinforcing feedback loop. Instead of just feeding the loop, it increases or decreases its output to get to the desired outcome in a way that affects other components in the system.

Slow request times caused requests to pile up and continue to hammer the system more and more. Back-offs and retries hit the system more than once, causing an increase in overall request volume. Slow disks affect all parts that read from them, causing delays throughout. I'm sure you can think of lots of other scenarios.

As we learned, humans make poor monitors for automated systems, so they need to be as tightly integrated into the process as possible, allowing highest visibility into what the system is doing.

While the system should be able to defend itself against any production issues, it shouldn't cut out the operator, quite the opposite. The operator should be an integral part of the process of running it.

But everything the system is doing, what it has been tasked by its engineers to do under specific circumstances, is automation.

What is automation?

What follows is the simplest definition I could think of that applies best to the system we build. I'd love to hear your take on it.

Automation is everything a complex system decides on doing in response to ever-changing and emerging conditions.

Tags: operations

A few weeks back I came across a post that struck home in several ways. "How I Fired myself" (cached version) is a short story of a developer who accidentally deleted the entire users table in production while working on a new feature. You should read the whole thing, go ahead, I'll wait for you.

What struck home was not just that he accidentally deleted data from the production database. I certainly did similar things, accidentally removing data, setting MySQL options at runtime that caused the whole process to crash, amongst other things.

The thing that really struck me was the story that unfolded after the incident. I came across the article just when I was reading Sidney Dekker's "The Field Guide to Human Error", a fascinating read, if I may add.

If you look at what happened after the incident, it's clear that everyone blames him, as if he had the malicious intent to just delete the whole table and cause unhappiness. His boss accuses him of potentially having lost the company millions, putting aside the possibility that he's helped make these millions too, and that he very likely didn't come in to work that day and lose a few of the company's millions.

This kind of reprimanding and the pressure from the team is what eventually caused this poor guy to quit his job. Which is a shame, because there is a lot more to this incident than meets the eye.

Nassim Taleb points out in "The Black Swan": "we are explanation-seeking animals who tend to think that everything has an identifiable cause and grab the most apparent one as the explanation."

We're quick to blame the human who seemingly caused accidents or deleted data from the production database. But that misses out on learning a bigger lesson, learning and improving the organization around the human.

As Scott Snook put it in "Friendly Fire", "look beyond individual error by framing puzzling behavior in complex organizations as individuals struggling to make sense."

There are two things that jump out when reading the text. The first is the fact that he's testing his local code against the production database, with full access to creating and removing data.

Add to that the fact that backups for the production database had been disabled (by someone in the company) two months before the incident. Let's look at them in more detail.

Testing against the production database

You could start arguing right away that this developer has been irresponsible testing his local code against the production database.

But if you reframe it to look at the bigger picture, the question emerges of why an organization whose data is worth millions lets developers test their local codes against the production database in the first place?

It is not uncommon for young startups to start out like this. It's just much easier and during startup crunch time, any means is acceptable that helps the team move and ship faster, even if there's a slight risk involved.

But, the longer the team continues to work in a mode like this, the more it gets used to testing against production data, removing and recreating data as needed. It becomes accepted practice. Every day that passes without any incident makes people more confident that they can continue to do what they've done for months or maybe even years.

In "Friendly Fire", Snook introduces the concept of practical drift, "the slow steady uncoupling of practice from written procedure."

While there may not have been a written procedure that said to not develop against the production database, I was still reminded of this story. The team drifted into the conception that they can continue to do what they've done for a while without any issues.

What's more interesting to ask for is why the organization didn't encourage for developers to work in their own sandboxes or at least on a staging system where they can't harm the valuable production data.

Asking for the why will very likely get you to more puzzle pieces that need to be questioned. Pieces that just happened came together in this one occasion to cause lots of harm to the business. While no one saw this coming, there was always a non-zero chance that it could happen.

In this case, it's very likely the organization that hasn't allowed the development or operations teams to set up proper environments for developers. Just like the bosses reprimanded the poor guy, they possibly didn't feel there's enough time or money around to invest in a proper development environment.

In a similar vein, another question could be why the developers had full access to the production database. Why were there no procedures in place that were required to delete data on the production database? You can keep going, and you will uncover more details that all came together to help trigger this one accident.

Database backups disabled

Deleting data is not something that's done every day, but incidents where data gets accidentally removed during a normal maintenance are not unheard of. There's always a possibility for this to happen, even during normal operations. Just think of the Amazon Elastic Load Balancer outage last Christmas.

What really screamed out at me was that someone in the company had cancelled the automated database backups, without any automated means set up to take their place.

Think about it, data that's supposedly worth millions has not been backed up for two months.

Was it this developer's fault that the normal safety net for any database operation wasn't in place? Highly unlikely.

We're again looking at an issue in the wider organization this guy was working in. The obvious question is: why was it cancelled in the first place? Was it because it was deemed to expensive? Was operations working on a replacement that was less costly but never got around to deploying it because there were more pressing issues that needed to be handled?

I found all this utterly fascinating, and I wanted to sit down with everyone involved to figure out why all these things were the way they were, and how they could come together in this one occasion to cause harm to the entire organization.

But most importantly, what this organization can learn to improve so that an issue under similar circumstances can be made less likely. Note that I didn't say to prevent these accidents from happening again. They will happen again, the real question is how the entire organization will handle the next one.

There is no single root cause

If you look at the things that came together to form this single incident, you'll notice that it wasn't just one little thing that caused it. It wasn't the developer who just so happened to delete the wrong table.

It was a number of causes that came together to strike hard, all of them very likely to be bigger issues inside the organization rather than a problem with the individual. Again, quoting from "The Black Swan": "a small input in a complex system can lead to nonrandom large results, depending on very special conditions."

The further back you go in time, the more reasons you'll find that there's a tangled web of lots of little causes, that again have their own causes, that just so happened to come together seemingly random to form this event that no one saw coming. But "while in theory randomness is an intrinsic property, in practice, randomness is incomplete information, what I called opacity" ("The Black Swan").

The important take-away, which is framed nicely by this little story, each little thing is necessary, but only jointly are they sufficient..

Every failure is an opportunity to learn, to improve how you run your business. Wouldn't it be a waste to ignore this invaluable insight?

Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this conference, I'm far from being a monitoring expert. If anything, I'm a user, a tinkerer of all the great tools we're hearing about at this conference.

I help run a little continuous integration service called Travis CI. For that purpose I built several home-baked things that help us collect metrics and trigger alerts.

I want to start with a little story. I spend quality time at coffee shops and I enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to the big roasting machine they commonly have a laptop with pretty graphs showing how the temperature in the roaster changes over time. On two occasions I found myself telling them: "Hey cool, I like graphs too!"

On the first occasion I looked at the graph and noticed that it'd update itself every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd really love it if it could update every second." In just two seconds the temperature in the roaster can already drop by almost a degree (Celsius), so he was lacking the granularity to get the best insight into his system.

The second roaster did have one second resolution, and I swooned. But I noticed that every minute or so, he wrote down the current temperature on a sheet of paper. The first guy had done that too. I was curious why they'd do that. He told me that he took it as his reference sheet for the next roasting batch. I asked why he didn't have the data stored in the system. He replied that he didn't trust it enough, because if it lost the information he wouldn't have a reference for his next roasting sheet.

He also keeps a set of coffee bean samples around from previous roasts, roasts where the outcome is known to have resulted in a great roasting result. Even coffee roasters have confirmation bias, though to be fully fair, when you're new to the job, any sort of reference can help you move forward.

This was quite curious. They had the technology yet they didn't trust it enough with their data. But heck, they had one-second resolution and they had the technology to measure data from live sensors in real time.

During my first jobs as a developer touching infrastructure, five minute collection intervals and RRDtool graphs were still very much en vogue. My alerts basically came from Monit throwing unhelpful emails at me stating that some process just changed from one state to another.

Since my days with Munin a lot has changed. We went through the era of #monitoringsucks, which fortunately, quickly turned into the era of #monitoringlove. It's been pretty incredible watching this progress as someone who loves tinkering with new and shiny tools and visualization possibilities. We've seen the emergence of crazy new visualization ideas, like the horizon chart, and we've seen the steady rise of using modern web technologies to render charts, while seeing RRDtool being taken to the next level to visualize time series data.

New approaches providing incredibly detailed insight into network traffic and providing stream analysis of time series data have emerged.

One second resolution is what we're all craving, looking at beautiful and constantly updating charts of 95th percentile values.

And yet, how many of you are still using Nagios?

There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.

Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.

There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring something that's easy to get into for people new to the field.

Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts. In the end, you still find yourself looking at one or more graphs trying to figure out what the hell it means.

Tracking metrics has become very popular, thanks to Coda Hale's metrics library, which inspired a whole slew of libraries for all kinds of languages, and tools like StatsD, which made it very easy to throw any kind of metric at them and have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.

Yet the biggest question that I get every time I talk to someone about monitoring, in particular people new to the idea, is: "what should I even monitor?"

With all the tools we have at hand, helping people to find the data that matters for their systems is still among the biggest hurdles that must be conquered to actually make sense of metrics.

Can we do a better job of educating people what they should track, what they could track, and how they can figure out the most important metrics for their system? It took us six months to find the single metric that best reflects the current state of our system. I called it the soul metric, the one metric that matters most to our users and customers.

We started tracking the time since the last build was started and since the last build was finished.

On our commercial platform, where customers run builds for their own products and customer projects, the weekend is very quiet. We only run one tenth of the number of builds on a Sunday compared to a normal weekday. Sometimes we don't run any build in 60 minutes. Suddenly checking when a build was last triggered makes a lot less sense.

Suddenly we're confronted with the issue that we need to look at multiple metrics in the same context to see if a build should even have been started, as the fact itself is solely based on a customer pushing code. We're suddenly looking at measuring the absence of data (no new commits) and correlate it with data derived from several attributes of the system, like no running builds and no build request being processed.

The only reasonable solution I could come up with, and it's mostly thanks to talking to Eric from Papertrail, is if you need to measure something but it require the existence of an activity, you have to make sure this activity is generated on a regular basis.

In hindsight, it's so obvious, though it brings up a question: if the thing that generates the activity fails, does that mean the system isn't working? Is this worth an alert, is this worth waking someone up for? Certainly not.

This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong?

If a coffee roaster doesn't trust his tools enough to give him a consistent insight into the current, past and future roasting batches, isn't that a weird mismatch between humans and the system that's supposed to give them the assurance that they're on the right path?

A roaster still trusts his instincts more than he trusts the data presented to him. After all, it's all about the resulting coffee bean.

Where does that take us and the current state of monitoring?

We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert.

As much as I enjoy staring at graphs, I'd much rather do something more important than that.

I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.

But much more than that, I'd like our monitoring system to be built for humans, reducing the barrier of entry for adding monitoring and metrics to an application and to infrastructure without much hassle. How we'll get there?

Looking at the current state of monitoring, there's a strong focus on technology, which is great, because it helps solves bigger issues like data storage, visualization and presentation, and stream analysis. I'd love to see this all converge on the single thing that has to make the call in the end: a human. Helping them make a good decision and getting there should be very high on our list.

There is a fallacy in this wish though. With more automation comes a cognitive bias to trust what the system is telling me. Can the data presented to me be fully trusted? Did the system actually make the right call in sending me an alert? This is only something a human can figure, just as a coffee roaster needs to trust his instincts even though the variables for every roast are slightly different.

We want to avoid for our users having to have a piece of paper around that tells them exactly what happened the last time this alert was triggered. We want to make sure they don't have to look at samples of beans at different stages to find confirmation for the problem at hand. If the end user always looks at previous samples of data to compare it to the most recent one, the only thing they'll look for is confirmation.

Lastly, the interfaces of the monitoring tools we work with every day are designed to be efficient, they're designed to dazzle with visualization, yet they're still far from being easy to use. If we want everyone in our company to be able to participate in running a system in production, we have to make sure the systems we provide them with interfaces that treat them as what they are: people.

But most importantly, I'd like to see the word spread on monitoring and metrics, making our user interfaces more accessible and tell the tale of how we monitor our systems, how other people can monitor their systems. There's a lot to learn from each other, and I love things like hangops and OpsSchool, they're great starts to get the word out.

Because it's easier to write things down to realize where you are, to figure out where you want to be.

Failure is still one of the most undervalued things in our business, in most businesses really. We still tend to point fingers elsewhere, blame the other department, or try anything to cover our asses.

How about we do something else instead? We embrace failure openly, turn it into our company's culture and do everything we can to make sure every failure is turned into a learning experience, into an opportunity?

Let me start with some illustrating examples.

Wings of Fury

In 2010, Boeing tested the wings of a brand new 787 Dreamliner. In a giant hangar, they set up a contraption that'd pull the wings of a 787 up, with so much pull that the wings were bound to break.

Eventually, and after they've been flexed upwards of 25 feet, the wings broke spectacularly.

The amazing bit: all the engineers watching it happen started to cheer and applaud.

Why? Because they anticipated the failure at the exact circumstances where it broke, at about 150% of what wings handle at normal operation.

They can break things loud and proud, they can predict when their engineering work falls apart. Can we do the same?

Safety first

I've been reading a great book, "The Power of Habit", and it outlines another story of failure and how tackling that was turned into an opportunity to improve company culture.

When Paul O'Neill, later to become Secretary of the Treasury, took over management of Alcoa, one of the United States' largest aluminum production companies, he made it his first and foremost to tackle the safety issues in the company's production plants.

He put rules in place that any accidents must be reported to him within just a few hours, including remedies on how this kind of accident will be prevented in the future.

While his main focus was to prevent failures, because they would harm or even kill workers, what he eventually managed to do is to implement a company culture where even the smallest suggestions to improve safety or to improve efficiency from any worker would be considered and would be handed up the chain of management.

This fostered a culture of highly increased communication between production plants, between managers, between workers.

Failures and accidents still happened, but were in sharp decline, as every single one was taken as an opportunity to learn and improve the situation to prevent them from happening again.

It was a chain of post-mortems if you will. O'Neill's interest was to make everyone part of improving the overall situation without having to fear blame. Everyone was made felt like they're an important part of the company. By then, 15000 people worked at Alcoa.

This had an interesting effect on the company. In twelve years, O'Neill managed to increase Alcoa's revenues from $1.5 to $23 billion dollars.

His policies became an integral part of the company's culture and ensured that everyone working for it felt like an integral part of the production chain.

Floor worker's were given permission to shut down the production chain if they deemed it necessary and were encouraged to whistle when they noticed even the slightest risk in any activity in the company's facilities.

To be quite fair, competitors were pretty much in the dark about these practices, which gave Alcoa a great advantage on the market.

But within a decade of running the company, he transformed it into a culture that sounds strikingly similar to the ideas of DevOps. He managed to make everyone feel responsible for delivering a great product and for everyone to be enabled to take charge should something go wrong.

All that is based on the premise of trust. Trust that when someone speaks up, they will be taken seriously.

Three Habits of Failure

If you look at the examples above, some patterns come up. There are companies outside of our field that have mastered or at least taken on an attitude of accepting that failure is inevitable, anticipating failure and dealing with and learning from failure.

Looking at some more examples it occurred to me that even doing one of these things will improve your company's culture significantly.

How do we fare?

We fail, a lot. It's in the nature of the hardware we use and the software we build. Networks partition, hard drives fail, software bugs creep into system that can lead to cascading failures.

But do we, as a community, take enough of advantage of what we learn from each outage?

Does your company hold post-mortem meetings after a production outage? Do you write public post-mortems for your customers?

If you don't, what's keeping you from doing so? Is it fear of giving your competitors an advantage? Is it fear of giving away too many internal details? Fear of admitting fault in public?

There's a great advantage in making this information public. Usually, it doesn't really concern your customers what happened in all detail. What does concern them is knowing that you're in control of the situation.

A post-mortem follows three Rs: regret, reason and remedy.

They're a means to say sorry to your customers, to tell them that you know what caused the issues and how you're going to fix them.

On the other hand, post-mortems are a great learning opportunity for your peer ops and development people.

Web Operations

This learning is an important part of improving the awareness of web operations, especially during development. There's a great deal to be learned from other people's experiences.

Web operations is a field that is mostly learning by doing right now. Which is an important part of the profession, without a doubt.

If you look at the available books, there are currently three books that give insight into what it means to build and run reliable and scalable systems.

"Release It!", "Web Operations" and "Scalable Internet Architectures" are the ones that come to mind.

My personal favorite is "Release It!", because it raises developer awareness on how to handle and prevent production issues in code.

It's great to see the circuit breaker and the bulkhead pattern introduced in this book now being popularized by Netflix, who openly write about their experiences implementing it.

Netflix is a great example here. They're very open about what they do, they write detailed post-mortems when there's an outage. You should read their engineering blog, same for Etsy's.

Why? Because it attracts engineering talent.

If you're looking for a job, which company would you rather work for? One that encourages taking risks while also taking responsibility for fixing issues when failure does come up, and one that enables a culture of fixing and improving issues as a whole rather than to put blame?

I'd certainly choose the former.

Over the last two years, Amazon has also realized how important this is. Their post-mortems have gotten very valuable for anyone interest in things that can happen in multi-tenant, distributed systems.

If you remember the most recent outage on Christmas Eve, they even had the guts to come out and say that production data was deleted by accident.

Can you imagine the shame these developers must feel? But can you imagine a culture where the issue itself is considered an opportunity to learn instead of blaming or firing you? If only to learn that accessing production data needs stricter policies.

It's a culture I'd love to see fostered in every company.

Regarding ops education, there have been some great things last year that are worth mentioning. hangops is a nice little circle, streamed live (mostly) every Friday, and available for anyone to watch on YouTube afterwards.

Ops School has started a great collection of introductory material on operations topics. It's still very young, but it's a great start, and you can help move it forward.

Travis CI

At Travis CI, we're learning from failure, a lot. As a continuous integration platform, it started out as a hobby project and was built with a lot of positive assumptions.

It used to be a distributed system that always assumed everything would work correctly all the time.

As we grew and added more languages and more projects, this ideal fell apart pretty quickly.

It is a symptom of a lot of projects that are developer-driven, because there's just so little public information on how to do it right, on how distributed systems are built and run at other companies for them to work reliably.

We decided to turn every failure into an opportunity to share our learnings. We're an open source project, so it only makes sense to be open about our problems too.

Our audience and customers, who are mostly developers themselves, seem to appreciate that. I for one am convinced that we owe to them.

I encourage you to do the same, to share details on your development, on how you run your systems. It'll be surprising how introducing these changes can affect working as a team as a whole.

Cultural evolution

This insight didn't come easy. We're a small team, and we were all on board with the general idea of openness about our operational work and about the failures in our system.

That openness brings with it the need to own your systems, to own your failures. It took a while for us to get used to working together as a team to get these issues out of the way as quickly as possible and to find a path for a fix.

In the beginning, it was still too easy to look elsewhere for the cause of the problem. Blame is one side of the story, hindsight bias is the other. It's too easy to point out that the issue has been brought up in the past, but that doesn't contribute anything to fixing it.

The more helpful attitude than saying "I've been saying this has been broken for months" is to say "Here's how I'll fix it." You own your failures.

The only thing that matters is delivering value to the customer. Putting aside blame and admitting fault while doing everything you can to make sure the issue is under control is, in my opinion, the only way how you can do that, with everyone in your company on board.

Accepting this might just help transform your company's culture significantly.

Two years ago, I wrote about the virtues of monitoring. A lot has changed, a lot has improved, and I've certainly learned a lot since I wrote that initial overview on monitoring as a whole.

There have been a lot of improvements to existing tools, and new players entered the market of monitoring. Infrastructure as a whole got more and more interesting for service business around them.

On the other hand, awareness for monitoring, good metrics, logging and the like has been rising significantly.

At the same time #monitoringsucks raised awareness that a lot of monitoring tools are still stuck in the late nineties when it comes to user interface and the way they work.

Independent of new and old tools, I've had the pleasure of learning a lot more about the real virtues of monitoring, about how it affects daily work and how it evolves over time. This post is about discussing some of these insights.

Monitoring all the way down

When you start monitoring even just small parts of an application, the need for more detail and for information about what's going on in a system arises quickly. You start with an innocent number of application level metrics, add metrics for database and external API latencies, start tracking system level and business metrics.

As you add monitoring to one layer of the system, the need to get more insight into the layer below comes up sooner or later.

One layer has just been tackled recently in a way that's accessible for anyone: communication between services on the network. Boundary has built some pretty cool monitoring stuff that gives you incredibly detailed insight into how services talk to each other, by way of their protocol, how network traffic from inside and outside a network develops over time, and all that down to the second.

The real time view is pretty spectacular to behold.

If you go down even further on a single host, you get to the level where you can monitor disk latencies.

Or you could measure the effect of screaming at a disk array of a running system. dtrace is a pretty incredible tool, and I hope to see it spread and become widely available on Linux systems. It allows you to inject instrumentation into arbitrary parts of the host system, making it possible measure any system call without a lot of overhead.

Heck, even our customer support tool allows us to track metrics for response times, how many tickets and for how long each staff member handled.

It's easy to start obsessing about monitoring and metrics, but there comes a time, when you either realize that you've obsessed for all the right reasons, or you add more monitoring.

Mo' monitoring, mo' problems

The crux of monitoring more layers of a system is that with more monitoring, you can and will detect more issues.

Consider Boundary, for example. It gives you insight into a layer you haven't had insight before, at least not at that granular level. For example, round trip times of liveness traffic in a RabbitMQ cluster.

This gives you a whole new pile of data to obsess about. It's good because that insight is very valuable. But it requires more attention, and more issues require investigation.

You also need to learn how a system behaving normally is reflected in those new systems, and what constitutes unusual behaviour. It takes time to learn and to interpret the data correctly.

In the long run though, that investment is well worth it.

Monitoring is an ongoing process

When we started adding monitoring to Travis CI, we started small. But we quickly realized what metrics really matter and what parts of the application and the infrastructure around it needs more insight, more metrics, more logging.

With every new component deployed to production, new metrics need to be maintained, more logging and new alerting need to be put in place.

The same is true for new parts of the infrastructure. With every new system or service added, new data needs to be collected to ensure the service is running smoothly.

A lot of the experience of what metrics are important there and which aren't, it's something that develops over time. Metrics can come and go, the requirements for metrics are subject to change, just as they are for code.

As you add new metrics, old metrics might become less useful, or you need more metrics in other parts of the setup to make sense of the new ones.

It's a constant process of refining the data you need to have the best possible insight into a running system.

Monitoring can affect production systems

The more data you collect, with higher and higher resolution, the more you run the risk of affecting a running system. Business metrics regularly pulled from the database can become a burden on the database that's supposed to serve your customers.

Pulling data out of running systems is a traditional approach to monitoring, one that's unlikely to go away any time soon. However, it's an approach that's less and less feasible as you increase resolution of your data.

Guaranteeing that this collection process is low on resources is hard. It's even harder to get a system up and running that can handle high-resolution data from a lot of services sent concurrently.

So new approaches have started to pop up to tackle this problem. Instead of pulling data from running processes, the processes themselves collect data and regularly push it to aggregation services which in turn send the data to a system for further aggregation, graphing, and the like.

StatsD is without a doubt the most popular one, and it has sparked a ton of forks in different languages

Instead of relying on TCP with its long connection handshakes and timeouts, StatsD uses UDP. The processes sending data to it stuff short messages into a UDP socket without worrying about whether or not the data arrives.

If some data doesn't make it because of network issues, that only leaves a small dent. It's more important for the system to serve customers than for it to wait around for the aggregation service to become available again.

While StatsD solves the problem of easily collecting and aggregating data without affecting production systems, there's now the problem of being able to inspect the high-resolution data in meaningful ways. Historical analysis and alerting on high-resolution data becomes a whole new challenge.

Riemann has popularized looking at monitoring data as a stream, to which you can apply queries, and form reactions based on those queries. You can move the data window inside the stream back and forth, so you can compare data in a historical context before deciding on whether it's worth an alert or not.

Systems like StatsD and Riemann make it a lot easier for systems to aggregate data without having to rely on polling. Services can just transmit their data without worrying much about how and where they're used for other purposes like log aggregation, graphing or alerting.

The important realization is that with increasing need for scalability and distributed systems, software needs to be built with monitoring in mind.

Imagine RabbitMQ that instead of you having to poll the data from it, sends its metrics as a message at a configurable interval to a configurable fanout. You can choose to consume the data and submit it to a system like StatsD or Riemann, or you can ignore it and the broker will just discard the data.

Who's monitoring the monitoring?

Another fallacy of monitoring is that it needs to be reliable. For it to be fully reliable it needs to be monitored. Wait, what?

Every process that is required to aggregate metrics, to trigger alerts, to analyze logs needs to be running for the system to work properly.

So monitoring in turns needs its own supervision to make sure it's working at all times. As monitoring grows it requires maintenance and operations to take care of it.

Which makes it a bit of a burden for small teams.

Lots of new companies have sprung into life serving this need. Instead of having to worry about running services for logs, metrics and alerting by themselves, it can be left to companies who are more experienced in running them.

Librato Metrics, Papertrail, OpsGenie, LogEntries, Instrumental, NewRelic, DataDog, to name a few. Other companies take the burden of having to run your own Graphite system away from you.

It's been interesting to see new companies pop up in this field, and I'm looking forward to seeing this space develop. The competition from the commercial space is bound to trigger innovation and improvements on the open source front as well.

We're heavy users of external services for log aggregation, collecting metrics and alerting. Simply put, they know better how to run that platform than we do, and it allows us to focus on delivering the best possible customer value.

Monitoring is getting better

Lots of new tools have sprung up in the last two years. While development on it started earlier than that, the most prominent tools are probably Graphite and Logstash. Cubism brings new ideas on how to visualize time series data, one of the several dozens of dashboards that Graphite's existence and flexibility by offering an API has sparked. Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.

It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.

I'm looking forward to seeing how the monitoring space evolves over the next two years.

Over the last year, as we started turning Travis CI into a hosted product, we added a ton of metrics and monitoring. While we started out slow, we soon figured out which metrics are key and which are necessary to monitor the overall behavior of the system.

I built us a custom collector that rakes in metrics from our database and from the API exposed by RabbitMQ. It soon dawned on me that these are our core metrics, and that they need not only graphs, we need to be alerted when they cross thresholds.

The first iteration of that dumped alerts into Campfire. Given that we're a small team and the room might be empty at times, that was just not sufficient for an infrastructure platform that's used by customers and open source projects around the world, at any time of the day.

So we added alerting, by way of OpsGenie. It's set up to trigger alerts via iPhone push notifications and escalations via SMS, should an alert not have been acknowledged or closed within 10 minutes. Eventually, escalation needs to be done via voice calls so that someone really picks up. It's easy to miss a vibrating iPhone when you're sound asleep, but much harder so when it keeps on vibrating until someone picks up.

A Pager for every Developer

Just recently I read an interview with Werner Vogels on architecture and operations at Amazon. He said something that struck with me: "You build it, you run it."

That got me thinking. Should developers of platforms be fully involved in the operations side of things?

A quick survey on Twitter showed that there are some companies where developers are paged when there are production issues, others fully rely on their operations team.

There's merit to both, but I could think of a few reasons why developers should be carrying a pager just like operations does.

You stay connected to what your code does in production. When code is developed, the common tool to manage expectations is to write tests. Unfortunately, no unit test, no integration test will be fully able to reproduce circumstances of what your code is doing in production.

You start thinking about your code running. Reasoning about what a particular piece of code is doing under specific production circumstances is hard, but not entirely impossible. When you're the one responsible for having it run smoothly and serve your customers, this goes up to a whole new level.

Metrics, instrumentation, alerting, logging and error handling suddenly become a natural part of your coding workflow. You start making your software more operable, because you're the one who has to run it. While software should be easy to operate in any circumstances, it commonly isn't. When you're the one having to deal with production issues, that suddenly has a very different appeal.

Code beauty is suddenly a bit less important than making sure your code can treat errors, timeouts, increased latencies. Kind of an ironic twist like that. Code that's resilient to production issues might not have a pretty DSL, it might not be the most beautiful code, but it may be able to sustain whatever issue is thrown at it.

Last, when you're responsible for running things in production, you're forced to learn about the entire stack of an application, not just the code bits, but its runtime, the host system, hardware, network. All that turns into something that feels a lot more natural over time.

I consider that a good thing.

There'll always be situations where something needs to be escalated to the operations team, with deeper knowledge of the hardware, network and the like. But if code breaks in production, and it affects customers, developers should be on the front of fixing it, just like the operations team.

Even more so for teams that don't have any operations people on board. At some point, a simple exception tracker just doesn't cut it anymore, especially when no one gets paged on critical errors.

Being On Call

For small teams in particular, there's a pickle that needs to be solved: who gets up in the middle of the night when an alert goes off?

When you have just a few people on the team, like your average bootstrapping startup, does an on call schedule make sense? This is something I haven't fully figured out yet.

We're currently in the fortunate position that one of our team members is in New Zealand, but we have yet to find a good way to assign on call when he's out or for when he's back on this side of the world.

The folks at dotCloud have written about their schedule, thank you! Hey, you should share your pager and on-call experiences too!

Currently we have a first come first serve setup. When an alert comes in and someone sees it, it gets acknowledged and looked into. If that involves everyone coming online, that's okay for now.

However, it's not an ideal setup, because being able to handle an alert means being able to log into remote systems, restart apps, inspect the database, look at the monitoring charts. Thanks to iPhone and iPad most of that is already possible today.

But to be fully equipped to handle any situation, it's good to have a laptop at hand.

This brings up the question: who's carrying a laptop and when? Which in turns means that some sort of on-call schedule is still required.

We're still struggling on this, so I'd love to read more about how other companies and teams handle that.


During a recent hangops discussion, there was a chat about developers being on call. It brought up an interesting idea, a playbook on how to handle specific alerts.

It's a document explaining things to look into when an alert comes up. Ideally, an alert already includes a link to the relevant section in the book. This is something operations and developers should work on together to make sure all fronts are covered.

It takes away some of the scare of being on call, as you can be sure there's some guidance when an issue comes up.

It also helps refine monitoring and alerts and make sure there are appropriate measures available to handle any of them. If there are not, that part needs improving.

I'm planning on building a playbook for Travis as we go along and refine our monitoring and alerts, it's a neat idea.

Sleepless in Seattle

There's a psychological side to being on-call that needs a lot of getting used to: the thought that an alert could go off at any time. While that's a natural thing, as failures do happen all the time, it's easy to mess up your head. It certainly did that for me.

Lying in bed, not being able to sleep, because your mind is waiting for an alert, it's not a great feeling. It takes getting used to. It's also why having an on-call schedule is preferable over an all hands scenario. When only one person is on call, team mates can at least be sure to get a good night's sleep. As the schedule should be rotating, everyone gets to have that luxury on a regular basis.

It does one thing though: it pushes you to make sure alerts only go off for relevant issues. Not everything needs to be fixed right away, some issues could be taken care of by improving the code, others are only temporary fluxes because of increased network latency and will resolve themselves after just a few minutes. Alerting someone on every other exception raised doesn't cut it anymore, alerts need to be concise and only be triggered when the error is severe enough and affects customers directly. Getting this right is the hard part, and it takes time.

All that urges you to constantly improve your monitoring setup, to increase relevance of alerts, and to make sure that everyone on the team is aware of the issues, how they can come up and how they can be fixed.

It's a good thing.

Tags: operations

Recently, I've been thinking a lot about failure, my daughter, risk and punishment, and the whole culture that has evolved around trying to avoid failure, trying to point fingers or putting blame elsewhere.

Simplest example: my daughter spills something over the table. What's the first reaction? Scolding or punishment of sorts. I'm guilty as charged. I read something pretty simple and wonderful recently, a very short read titled "Father Forgets".

That read got me thinking: why do we tend to punish failure immediately? It's not just something to do with our kids, it's human nature. We tend to put blame elsewhere, we tend to get defensive because people turn to us to fix a problem, when something is broken in production, for example.

Why can't we instead make failure a part of our culture? Not just at home, with our kids, but in our work place?

As soon as people feel like they need to get defensive, or they're blamed for a problem that occurred due to a recent change of theirs, negativity hits everyone on the team. It's hard to stay calm, it's hard to stay focused on what really matters: that something is broken in production, affecting your customers.

As soon as people feel threatened or pressured, they get defensive or they feel down because some of their own code broke something. Their vision is clouded. Finding the problem's cause and implementing a solution is suddenly just a blur, something that's hard to focus on. Even though that's what really that matters.

When people feel like failure is not an option, they'll stop taking risks. When people stop taking risks, your team and your company is doomed, innovation comes to a grinding halt. Most of us are in the lucky position that lives don't depend on our work. We can try new things, iterate quickly, disregard or improve them.

If my daughter doesn't take any risks because I keep punishing or scolding her, she might just stop trying altogether. The analogy is an odd one, but there's a striking similarity.

If a problem comes up, you fix it, you learn your lesson, you make sure it doesn't happen again, you move on. It can be that simple. When everyone on the team feels like failure is an accepted part of running an application, fixing the problems as they occur as a team becomes a lot easier.

In the end, it's not a question of if something breaks, it's rather about when it breaks. And the answer is: all the time. Great teams focus on the one thing that matters in these situations: how to best resolve the situation and on being ready when it does.

Embrace outages, the most common failure of our craft. Take a deep breath, phase out distractions (including managers) and try to find joy in digging through data and finding what's causing a problem. Turn it from a seemingly frustrating experience into a personal challenge. You find the problem, you fix it, you make customers happy again. Rinse, repeat.

Failure is cool.

This post is not about devops, it's not about lean startups, it's not about web scale, it's not about the cloud, and it's not about continuous deployment. This post is about you, the developer who's main purpose in life has always been to build great web applications. In a pretty traditional world you write code, you write tests for it, you deploy, and you go home. Until now.

To tell you the truth, that world has never existed for me. In all of my developer life I had to deal with all aspects of deployment, not just putting build artifacts on servers, but dealing with network outages, faulty network drivers, crashing hard disks, sudden latency spikes, analyzing errors coming from those pesky crawling bots on that evil internet of yours. I take a lot of this for granted, but working in infrastructure and closely with developers trying to get applications and infrastructure up and running on EC2 has taught me some valuable lessons to assume the worst. Not because developers are stupid, but because they like to focus on code, not infrastructure.

But here's the deal: your code and all your full-stack and unit tests is worth squat if they're not running out there on some server or infrastructure stack like Google Apps or Heroku. Without running somewhere in production, your code doesn't generate any business value, it's just a big pile of ASCII or UTF-8 characters that cost a lot of money to create, but didn't offer any return of investment yet.

Love Thy Infrastructure

Operations isn't hard, but necessary. You don't need to know everything about operations to become fluent in it, you just have to know enough to start and know how to use Google.

This is my collective dump from the last years of working both as a developer and that guy who does deployments and manages servers too. Most are lessons I learned the hard way, others just seemed logical to me when I learned about them the first time around.

Between you and me, having this skill set at hand makes you a much more valuable developer. Being able to analyze any problem in production and at least having a basic skill set to deal with it makes you a great asset for companies and clients to hold on to. Thought you should know, but I digress.

The most important lesson I can tell you right up front: love your infrastructure, it's the muscles and bones of your application, whereas your code running on it is nothing more than the skin.

Without Infrastructure, No-one Will Use Your Application

Big surprise. For users to be able to enjoy your precious code, it needs to run somewhere. It needs to run on some sort of infrastructure, and it doesn't matter if you're managing it, or if you're paying another company to take care of it for you.

Everything Is Infrastructure

Every little piece of software and hardware that's necessary to make your application available to users is infrastructure. The application server serving and executing your code, the web server, your email delivery provider, the service that tracks errors and application metrics, the servers or virtual machines your services are running on.

Every little piece of it can break at any time, can stall at any time. The more pieces you have in your application puzzle, the more breaking points you have. And everything that can break, will break. Usually not all at once, but most certainly when it's the least expected, or just when you really need your application to be available.

On Day One, You Build The Hardware

Everything starts with a bare metal server, even that cloud you've heard so much about. Knowing your way around everything that's related to setting up a full rack of servers on a single day, including network storage a fully configured switch with two virtual LANs and a master-slave database setup using a RAID 10 a bunch of SAS drives might not be something you need every day, but it sure comes in handy.

The good news is the internet is here for you. You don't need to know everything about every piece of hardware out there, but you should be able to investigate strengths and weaknesses, when an SSD is an appropriate tool to use, and when SAS drives will kick butt.

Learn to distinguish the different levels of RAID, why having an additional file system buffer on top of a RAID that doesn't have a backup battery for its own, internal write buffer is a bad idea. That's a pretty good start, and will make decisions much easier.

The System

Do you know what swap space is? Do you know what happens when it's used by the operating system, and why it's actually a terrible thing and gives a false sense of security? Do you know what happens when all available memory is exhausted?

Let me tell you:

  • When all available memory is allocated, the operating system starts swapping out memory pages to swap space, which is located on disk, a very slow disk, slow like a snail compared to fast memory.
  • When lots of stuff is written to and read from swap space on disk, I/O wait goes through the roof, and processes start to pile up waiting for their memory pages to be swapped out to or read from disk, which in turn increases load average, and almost brings the system to a screeching halt, but only almost.
  • Swap is terrible because it gives you a false sense of having additional resources beyond the available memory, while what it really does is slowing down performance in a way that makes it almost impossible for you to log into the affected system and properly analyze the problem.

This is basically operations level on the operating system level. It's not much you need to know here, but in my opinion it's essential. Learn about the most important aspects of a Unix or Linux system. You don't need to know everything, you don't need to know the specifics of Linux' process scheduler or the underlying datastructure used for virtual memory. But the more you know, the more informed your decisions will be when the rubber hits the road.

And yes, I think enabling swap on servers is a terrible idea. Let processes crash when they don't have any resources left. That at least will allow you to analyze and fix.

Production Problems Don't Solve Themselves

Granted, sometimes they do, but you shouldn't be happy about that. You should be willing to dig into whatever data you have posthumous to find whatever went wrong, whatever caused a strange latency spike in database queries, or caused an unusually high amount of errors in your application.

When a problem doesn't solve itself though, which is certainly the common case, someone needs to solve it. Someone needs to look at all the available data to find out what's wrong with your application, your servers or the network.

This person is not the unlucky operations guy who's currently on call, because let's face it, smaller startups just don't have an operations team.

That person is you.

Solve Deployment First

When the first line of code is written, and the first piece of your application is ready to be pushed on a server for someone to see, solve the problem of deployment. This has never been easier than it is today, and being able to push incremental updates from then on speeds up development and the customer feedback cycle considerably.

As soon as you can, build that Capfile, Ant file, or whatever build and deployment tools you're using, set up servers, or set up your project on an infrastructure platform like Scalarium, Heroku, Google Apps, or dotCloud. The sooner you solve this problem, the easier it will be to finally push that code of yours into production for everyone to use. I consider application deployment a solved problem. There's no reason why you shouldn't have it in place even in the earliest stages of a project.

The more complex a project gets over even just its initial lifecycle the easier it will be to add more functionality to an existing deployment setup instead of having to build everything from scratch.

Automate, Automate, Automate

Everything you do by hand, you should only be doing once. If there's any chance that particular action will be repeated at some point, invest the time to turn it into a script. It doesn't matter if it's a shell, a Ruby, a Perl, or a Python script. Just make it reusable. Typing things into a shell manually, or updating configuration files with an editor on every single server is tedious work, work that you shouldn't be doing manually more than once.

When you automate something once, it not only greatly increases execution speed the second and third time around, it reduces the chance of failure, of missing that one important step.

There's an abundance of tools available to automate infrastructure, hand-written script are only the simplest part of it. Once you go beyond managing just one or two servers, tools like Chef, Puppet and MCollective come in very handy to automate everything from setting up bare servers to pushing out configuration changes from a single point, to deploying code. Everything should be properly automated with some tool. Ideally you only use one, but looking at Chef and Puppet, both have their strength and weaknesses.

Changes in Chef aren't instant, unless you use the command line tool knife, which assumes SSH access to all servers you're managing. The bigger your organizations the less chance you'll have to be able to access all machines via SSH. Instant tools like mCollective that work based on a push agent system, are much better for these instant kinds of activities.

It's not important what kind of tool you use to automate, what's important is that you do it in the first place.

By the way, if your operations team restricts SSH access to machines for developers, fix that. Developers need to be able to analyze and fix incidents just like the operations folks do. There's no valid point in denying SSH access to developers. Period.

Introduce New Infrastructure Carefully

Whenever you add a new component, a new feature to an application, you add a new point of failure. Be it a background task scheduler, a messaging queue, an image processing chain or asynchronous mail delivery, it can and it will fail.

It's always tempting to add shiny new tools to the mix. Developers are prone to trying out new tools even though they've not yet fully proven themselves in production, or experience running them is still sparse. It's a good thing in one way, because without people daring to use new tools everyone else won't be able to learn from their experiences (you do share those experiences, do you?).

But on the other hand, you'll live the curse of the early adopter. Instead of benefiting from existing knowledge, you're the one bringing the knowledge into existence. You'll experience all the bugs that are still lurking in the darker corners of that shiny new database or message queue system. You'll spend time developing tools and libraries to work with the new stuff, time you could just as well be spending working on generating new business value by using existing tools that do the job similarly well. If you do decide for a new tool, be prepared to degrade back to other tools in the case of failure.

No matter if old or new, adding more infrastructure always has the potential for more things to break. Whenever you add something, be sure to know what you're getting yourself into, be sure to have fallback procedures in place, be sure everyone knows about the risks and the benefits. When something that's still pretty new breaks, you're usually on your own.

Make Activities Repeatable

Every activity in your application that causes other, dependent activities to be executed, needs to be repeatable, either by the user, or through some sort of administrative interface, or automatically if feasible. Think user confirmation emails, generating monthly reports, background tasks like processing uploads. Every activity that's out of the normal cycle of fetching records from a datasource and updating them is bound to fail. Heck, even that cycle will fail at some point due to some odd error that only comes up every once in a blue moon.

When an activity is repeatable, it's much easier to deal with outages of single components. When it comes back up, simply re-execute the tasks that got stuck.

This, however, requires one important thing: every activity must be idempotent. It must have the same outcome no matters how often it's being run. It must know what steps were already taken before it broke the last time around. Whatever's already been done, it shouldn't be done again. It should just pick up where it left off.

Yes, this requires a lot of work and care for state in your application. But trust me, it'll be worth it.

Use Feature Flips

New features can cause joy and more headaches. Flickr was one of the first to add something called feature flips, a simple way to enable and disable features for all or only specific users. This way you can throw new features onto your production systems without accidentally enabling it for all users, you can simply allow a small set of users or just your customer to use it and to play with it.

What's more important though, when a feature breaks in production for some reason, you can simply switch it off, disabling traffic on the systems involved, allowing you to take a breether and analyze the problem.

Feature flips come in many flavors, the simplest approach is to just use a configuration file to enable or disable them. Other approaches use a centralized database like Redis for that purpose, which has an added benefit for other parts of your application, but also adds new infrastructure components and therefore, more complexity and more points of failure.

Fail And Degrade Gracefully

What happens when you unplug your database server? Does your application throw in the towel by showing a 500 error, or is it able to deal with the situation and show a temporary page informing the user of what's wrong? You should try it and see what happens.

Whenever something non-critical breaks, your application should be able to deal with it without anything else breaking. This sounds like an impossible thing to do, but it's really not. It just requires care, care your standard unit tests won't be able to deliver, and thinking about where you want a breakage to leak to the user, or where you just ignore it, picking up work again as soon as the failed component becomes available again.

Failing gracefully can mean a lot of things, there's things that directly affect user experience, a database failure comes to mind, and things that the user will notice only indirectly, e.g. through delays in delivering emails or fetching data from an external service like Twitter, RSS feeds and so on.

When a major component in your application fails, a user will most likely be unable to use your application at all. When your database latency increases manifold, you have two options. Try to squeeze through as much as you can, accepting long waits on your user's side, or you can let him know that it's currently impossible to serve him in an acceptable time frame, and that you're actively working on fixing or improving the situations. Which you should, either way.

Delays in external services or asynchronous tasks are much harder for a user to notice. If fetching data from an external source, like an API, directly affects your site's latency, there's your problem.

Noticing problems in external services requires two things: monitoring and metrics. Only by tracking queue sizes, latency for calls to external services, mail queues and all things related to asynchronous tasks will you be able to tell when your users are indirectly affected by a problem in your infrastructure.

After all, knowing is half the battle.

Monitoring Sucks, You Need It Anyway

I've written in abundance on the virtues of monitoring, metrics and alerting. I can't say it enough how important having a proper monitoring and metrics gathering system in place is. It should be by your side from day one of any testing deployment.

Set up alerts for thresholds that seem like a reasonable place to start to you. Don't ignore alerting notifications, once you get into that habit, you'll miss that one important notification that's real. Instead, learn about your system and its thresholds over time.

You'll never get alerting and thresholds right the first time, you'll adapt over time, identifying false negatives and false positives, but if you don't have a system in place at all, you'll never know what hit your application or your servers.

If you're not using a tool to gather metrics like Munin, Ganglia, New Relic, or collectd, you'll be in for a big surprise once your application becomes unresponsive for some reason. You'll simply never find out what the reason was in the first place.

While Munin has basic built-in alerting capabilities, chances are you'll add something like Nagios or PagerDuty to the mix for alerting.

Most monitoring tools suck, you'll need them anyway.

Supervise Everything

Any process that's required to be running at any time needs to be supervised. When something crashes be sure there's an automated procedure in place that will either restart the process or notify you when it can't do so, degrading gracefully. Monit, God, bluepill, supervisord, RUnit, the number of tools available to you is endless.

Micromanaging people is wrong, but processes need that extra set of eyes on them at all times.

Don't Guess, Measure!

Whatever directly affects your users' experience affects your business. When your site is slow, users will shy away from using it, from generating revenue and therefore (usually) profit.

Whenever a user has to wait for anything, they're not willing to wait forever. If an uploaded video takes hours to process, they'll go to the next video hosting site. When a confirmation email takes hours to be delivered, they'll check out your competitor, taking the money with them.

How do you know that users have to wait? Simple, you track how long things in your application take, how many tasks are currently stuck in your processing queue, how long it took to process an upload. You stick metrics on anything that's directly or indirectly responsible for generating business value.

Without having a proper system to collect metrics in place, you'll be blind. You'll have no idea what's going inside your application at any given time. Since Coda Hale's talk "Metrics Everywhere" at CodeConf and the release of his metrics library for Scala, an abundance of libraries for different languages has popped up left and right. They make it easy to include timers, counters, and other types of metrics into your application, allowing you to instrument code where you see fit. Independently, Twitter has lead the way by releasing Ostrich, their own Scala library to collect metrics. The tools are here for you. Use them.

The most important metrics should be easily accessible on some sort of dashboard. You don't need a big fancy screen in your office right away, a canonical place, e.g. a website including the most important graphs and numbers, where everyone can go and see what's going on with a glance is a good start. Once you have that in place, the next step towards a company-visible dashboard is simple buying a big-ass screen.

All metrics should be collected in a tool like Ganglia, Munin or something else. These tools make analysis of historical data easy, they allow you to make predictions or correlate the metrics gathered in your applications to other statistics like CPU, memory usage, I/O waits, and so on.

The importance of monitoring and metrics cannot be stressed enough. There's no reason why you shouldn't have it in place. Setting up Munin is easy enough, setting up collection using an external service like New Relic or Scout is usually even easier.

Use Timeouts Everywhere

Latency is your biggest enemy in any networked environment. It creeps up on you like the shadow of the setting sun. There's a whole bunch of reasons why, e.g. database queries will suddenly see a spike in execution time, or external services suddenly take forever to answer even the simplest requests.

If your code doesn't have appropriate timeouts, requests will pile up and maybe never return, exhausting available resources (think connection pools) faster than Vettel does a round in Monte Carlo.

Amazon for example has internal contracts. Going to their home page involves dozens of requests to internal services. If any one of them doesn't respond in a timely manner, say 300 ms, the application serving the page will render a static piece snippet instead, but thereby decreasing the chance of selling something, directly affecting business value.

You need to treat every call to an external resource as something that can take forever, something that potentially blocks an application server process forever. When an application server process or thread is blocked, it can't serve any other client. When all processes and threads lock up waiting for a resource, your website is dead.

Timeouts make sure that resources are freed and made available again after a grace period. When a database query takes longer than usual, not only does your application need to know how to handle that case, but your database needs to. If your application has a timeout, but your database will happily keep sorting those millions of records in a temp file on disk, you didn't gain a lot. If two dependent resources are within your hands, both need to be aware of contracts and timeouts, both need to properly free resources when the request couldn't be served in a timely manner.

Use timeouts everywhere, but know how to handle them when they occur, know what to tell the user when his request didn't return quickly enough. There is no golden rule what to do with a timeout, it depends not just on your application, but on the specific use case.

Don't Rely on SLAs

The best service fails at some point. It will fail in the most epic way possible, not allowing any user to do anything. This doesn't have to be your service. It can be any service you directly or indirectly rely on.

Say, your code runs on Heroku. Heroku's infrastructure runs on Amazon's EC2. Therefore Heroku is prone to problems with EC2. If a provider like Heroku tells you they have a service level agreement in place that guarantees a minimum amount of availability per month or per year, that's worth squat to you, because they in turn rely on other external services, that may or may not offer different SLAs. This is not specific to Heroku, it's just an obvious example. Just because you outsourced infrastructure doesn't mean you're allowed to stop caring.

If your application runs directly on EC2, you're bound by the same problem. The same is true for any infrastructure provider you rely on, even a big hosting company where your own server hardware is colocated.

They all have some sort of SLA in place, and they all will screw you over with the terms of said SLA. When stuff breaks on their end, that SLA is not worth a single dime to you, even when you were promised to get your money back. It will never make up for lost revenue, for lost users and decreased uptime on your end. You might as well stop thinking about them in the first place.

What matters is what procedures any provider you rely on has in place in case of a failure. The important thing for you as one of their users is to not be left standing in the rain when your hosting world is coming close to an end. A communicative provider is more valuable than one that guarantees an impossible amount of availability. Things will break, not just for you. SLAs give you that false sense of security, the sense that you can blame an outage on someone else.

For more on this topic, Ben Black has written a two part series aptly named "Service Level Disagreements".

Know Your Database

You should know what happens inside your database when you execute any query. Period. You should know where to look when a query takes too long, and you should know what commands to use to analyze why it takes too long.

Do you know how an index is built? How and why your database picks one index over another? Why selecting a random record based on the wrong criteria will kill your database?

You should know these things. You should read "High Performance MySQL", or "Oracle Internals", or "PostgreSQL 9.0 High Performance". Sorry, I didn't mean to say you should, I meant you must read them.

Love Your Log Files

In case of an emergency, a good set of log files will mean the world to you. This doesn't just include the standard set of log files available on a Unix system. It includes your application and all services involved too.

Your application should log important events, anything that may seem useful to analyze an incident. Again, you'll never get this right the first time around, you'll never know up front all the details you may be interested in later. Adapt and improve, add more logging as needed. It should allow you to tune the log verbosity at runtime, either by a using a feature switch or by accepting a Unix signal.

Separate request logging from application logging. Data on HTTP requests is just as important as application logs, but it's easier if you can sift through them independently, they're also a lot easier to aggregate for services like Syslog or Loggly when they're on their own.

For you Rails developers out there: using Rails.logger is not an acceptable logging mechanism. All your logged statements will be intermingled with Rails next to unusable request logging output. Use a separate log file for anything that's important to your application.

Just like you should stick metrics on all things that are important to your business, log additional information when things get out of hand. Correlating log files with metrics gathered on your servers and in your application is an incredibly powerful way of analyzing incidents, even long after they occurred.

Learn the Unix Command Line

In case of a failure, the command line will be your best friend. Knowing the right tools to quickly sift through a set of log files, being able to find and set certain kernel parameters to adjust TCP settings, knowing how get the most important system statistics with just a few commands, and knowing where to look for a specific service's configuration. All these things are incredibly valuable in case of a failure.

Knowing your way around a Unix or Linux system, even with just a basic toolset is something that will make your life much easier, not just in operations, but also as a developer. The more tools you have at your disposal, the easier it will be for you to automate tasks, to not be scared of operations in general.

In times of an emergency, you can't afford to argue that your favorite editor is not installed on a system, you use what's available.

At Scale, Everything Breaks

Working at large scale is nothing anyone should strive for, it's a terrible burden, but an incredibly fascinating one. The need for scalability evolves over time, it's nothing you can easily predict or assume without knowing all the details, parameters and the future. Out of all three, at least one is 100% guess work.

The larger your infrastructure setup gets, the more things will break. The more servers you have, the larger the number of servers being not available at any time. That's nothing you need to respect right from the get go, it's something to keep in mind.

No service that's working at a larger scale was originally designed for it. The code and infrastructure were adapted, the services grew over time, and they failed a lot. Something to think about when you reach for that awesome scalable database before even having any running code.

Embrace Failure

The bottom line of everything is, stuff breaks, everything breaks at different scale. Embrace breakage and failure, it will help you learn and improve your knowledge and skill set over time. Analyze incidents using the data available to you, fix the problem, learn your lesson, and move on.

Don't embrace one thing though: never let a failure happen again if you know what caused it the first time around.

Web operations is not solely related to servers and installing software packages. Web operations involves everything required to keep an application available, and your code needs to play along.

Required Reading

As 101s go, this is a short overview of what I think makes up for a good starter set of operations skills. If you don't believe or trust me (which is a good thing), here's a list of further reading for you. By now, I consider most of these required reading even. The list isn't long, mind you. The truth as of today is still that you learn the most out of personal experience on production systems. Both require one basic skill though: you have to want to learn.

Shameless Plug

If you liked this article, you may enjoy the book I'm currently working on: "The NoSQL Handbook".

Tags: operations