A few weeks back I came across a post that struck home in several ways. "How I Fired myself" (cached version) is a short story of a developer who accidentally deleted the entire users table in production while working on a new feature. You should read the whole thing, go ahead, I'll wait for you.

What struck home was not just that he accidentally deleted data from the production database. I certainly did similar things, accidentally removing data, setting MySQL options at runtime that caused the whole process to crash, amongst other things.

The thing that really struck me was the story that unfolded after the incident. I came across the article just when I was reading Sidney Dekker's "The Field Guide to Human Error", a fascinating read, if I may add.

If you look at what happened after the incident, it's clear that everyone blames him, as if he had the malicious intent to just delete the whole table and cause unhappiness. His boss accuses him of potentially having lost the company millions, putting aside the possibility that he's helped make these millions too, and that he very likely didn't come in to work that day and lose a few of the company's millions.

This kind of reprimanding and the pressure from the team is what eventually caused this poor guy to quit his job. Which is a shame, because there is a lot more to this incident than meets the eye.

Nassim Taleb points out in "The Black Swan": "we are explanation-seeking animals who tend to think that everything has an identifiable cause and grab the most apparent one as the explanation."

We're quick to blame the human who seemingly caused accidents or deleted data from the production database. But that misses out on learning a bigger lesson, learning and improving the organization around the human.

As Scott Snook put it in "Friendly Fire", "look beyond individual error by framing puzzling behavior in complex organizations as individuals struggling to make sense."

There are two things that jump out when reading the text. The first is the fact that he's testing his local code against the production database, with full access to creating and removing data.

Add to that the fact that backups for the production database had been disabled (by someone in the company) two months before the incident. Let's look at them in more detail.

Testing against the production database

You could start arguing right away that this developer has been irresponsible testing his local code against the production database.

But if you reframe it to look at the bigger picture, the question emerges of why an organization whose data is worth millions lets developers test their local codes against the production database in the first place?

It is not uncommon for young startups to start out like this. It's just much easier and during startup crunch time, any means is acceptable that helps the team move and ship faster, even if there's a slight risk involved.

But, the longer the team continues to work in a mode like this, the more it gets used to testing against production data, removing and recreating data as needed. It becomes accepted practice. Every day that passes without any incident makes people more confident that they can continue to do what they've done for months or maybe even years.

In "Friendly Fire", Snook introduces the concept of practical drift, "the slow steady uncoupling of practice from written procedure."

While there may not have been a written procedure that said to not develop against the production database, I was still reminded of this story. The team drifted into the conception that they can continue to do what they've done for a while without any issues.

What's more interesting to ask for is why the organization didn't encourage for developers to work in their own sandboxes or at least on a staging system where they can't harm the valuable production data.

Asking for the why will very likely get you to more puzzle pieces that need to be questioned. Pieces that just happened came together in this one occasion to cause lots of harm to the business. While no one saw this coming, there was always a non-zero chance that it could happen.

In this case, it's very likely the organization that hasn't allowed the development or operations teams to set up proper environments for developers. Just like the bosses reprimanded the poor guy, they possibly didn't feel there's enough time or money around to invest in a proper development environment.

In a similar vein, another question could be why the developers had full access to the production database. Why were there no procedures in place that were required to delete data on the production database? You can keep going, and you will uncover more details that all came together to help trigger this one accident.

Database backups disabled

Deleting data is not something that's done every day, but incidents where data gets accidentally removed during a normal maintenance are not unheard of. There's always a possibility for this to happen, even during normal operations. Just think of the Amazon Elastic Load Balancer outage last Christmas.

What really screamed out at me was that someone in the company had cancelled the automated database backups, without any automated means set up to take their place.

Think about it, data that's supposedly worth millions has not been backed up for two months.

Was it this developer's fault that the normal safety net for any database operation wasn't in place? Highly unlikely.

We're again looking at an issue in the wider organization this guy was working in. The obvious question is: why was it cancelled in the first place? Was it because it was deemed to expensive? Was operations working on a replacement that was less costly but never got around to deploying it because there were more pressing issues that needed to be handled?

I found all this utterly fascinating, and I wanted to sit down with everyone involved to figure out why all these things were the way they were, and how they could come together in this one occasion to cause harm to the entire organization.

But most importantly, what this organization can learn to improve so that an issue under similar circumstances can be made less likely. Note that I didn't say to prevent these accidents from happening again. They will happen again, the real question is how the entire organization will handle the next one.

There is no single root cause

If you look at the things that came together to form this single incident, you'll notice that it wasn't just one little thing that caused it. It wasn't the developer who just so happened to delete the wrong table.

It was a number of causes that came together to strike hard, all of them very likely to be bigger issues inside the organization rather than a problem with the individual. Again, quoting from "The Black Swan": "a small input in a complex system can lead to nonrandom large results, depending on very special conditions."

The further back you go in time, the more reasons you'll find that there's a tangled web of lots of little causes, that again have their own causes, that just so happened to come together seemingly random to form this event that no one saw coming. But "while in theory randomness is an intrinsic property, in practice, randomness is incomplete information, what I called opacity" ("The Black Swan").

The important take-away, which is framed nicely by this little story, each little thing is necessary, but only jointly are they sufficient..

Every failure is an opportunity to learn, to improve how you run your business. Wouldn't it be a waste to ignore this invaluable insight?

Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this conference, I'm far from being a monitoring expert. If anything, I'm a user, a tinkerer of all the great tools we're hearing about at this conference.

I help run a little continuous integration service called Travis CI. For that purpose I built several home-baked things that help us collect metrics and trigger alerts.

I want to start with a little story. I spend quality time at coffee shops and I enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to the big roasting machine they commonly have a laptop with pretty graphs showing how the temperature in the roaster changes over time. On two occasions I found myself telling them: "Hey cool, I like graphs too!"

On the first occasion I looked at the graph and noticed that it'd update itself every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd really love it if it could update every second." In just two seconds the temperature in the roaster can already drop by almost a degree (Celsius), so he was lacking the granularity to get the best insight into his system.

The second roaster did have one second resolution, and I swooned. But I noticed that every minute or so, he wrote down the current temperature on a sheet of paper. The first guy had done that too. I was curious why they'd do that. He told me that he took it as his reference sheet for the next roasting batch. I asked why he didn't have the data stored in the system. He replied that he didn't trust it enough, because if it lost the information he wouldn't have a reference for his next roasting sheet.

He also keeps a set of coffee bean samples around from previous roasts, roasts where the outcome is known to have resulted in a great roasting result. Even coffee roasters have confirmation bias, though to be fully fair, when you're new to the job, any sort of reference can help you move forward.

This was quite curious. They had the technology yet they didn't trust it enough with their data. But heck, they had one-second resolution and they had the technology to measure data from live sensors in real time.

During my first jobs as a developer touching infrastructure, five minute collection intervals and RRDtool graphs were still very much en vogue. My alerts basically came from Monit throwing unhelpful emails at me stating that some process just changed from one state to another.

Since my days with Munin a lot has changed. We went through the era of #monitoringsucks, which fortunately, quickly turned into the era of #monitoringlove. It's been pretty incredible watching this progress as someone who loves tinkering with new and shiny tools and visualization possibilities. We've seen the emergence of crazy new visualization ideas, like the horizon chart, and we've seen the steady rise of using modern web technologies to render charts, while seeing RRDtool being taken to the next level to visualize time series data.

New approaches providing incredibly detailed insight into network traffic and providing stream analysis of time series data have emerged.

One second resolution is what we're all craving, looking at beautiful and constantly updating charts of 95th percentile values.

And yet, how many of you are still using Nagios?

There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.

Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.

There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring something that's easy to get into for people new to the field.

Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts. In the end, you still find yourself looking at one or more graphs trying to figure out what the hell it means.

Tracking metrics has become very popular, thanks to Coda Hale's metrics library, which inspired a whole slew of libraries for all kinds of languages, and tools like StatsD, which made it very easy to throw any kind of metric at them and have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.

Yet the biggest question that I get every time I talk to someone about monitoring, in particular people new to the idea, is: "what should I even monitor?"

With all the tools we have at hand, helping people to find the data that matters for their systems is still among the biggest hurdles that must be conquered to actually make sense of metrics.

Can we do a better job of educating people what they should track, what they could track, and how they can figure out the most important metrics for their system? It took us six months to find the single metric that best reflects the current state of our system. I called it the soul metric, the one metric that matters most to our users and customers.

We started tracking the time since the last build was started and since the last build was finished.

On our commercial platform, where customers run builds for their own products and customer projects, the weekend is very quiet. We only run one tenth of the number of builds on a Sunday compared to a normal weekday. Sometimes we don't run any build in 60 minutes. Suddenly checking when a build was last triggered makes a lot less sense.

Suddenly we're confronted with the issue that we need to look at multiple metrics in the same context to see if a build should even have been started, as the fact itself is solely based on a customer pushing code. We're suddenly looking at measuring the absence of data (no new commits) and correlate it with data derived from several attributes of the system, like no running builds and no build request being processed.

The only reasonable solution I could come up with, and it's mostly thanks to talking to Eric from Papertrail, is if you need to measure something but it require the existence of an activity, you have to make sure this activity is generated on a regular basis.

In hindsight, it's so obvious, though it brings up a question: if the thing that generates the activity fails, does that mean the system isn't working? Is this worth an alert, is this worth waking someone up for? Certainly not.

This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong?

If a coffee roaster doesn't trust his tools enough to give him a consistent insight into the current, past and future roasting batches, isn't that a weird mismatch between humans and the system that's supposed to give them the assurance that they're on the right path?

A roaster still trusts his instincts more than he trusts the data presented to him. After all, it's all about the resulting coffee bean.

Where does that take us and the current state of monitoring?

We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert.

As much as I enjoy staring at graphs, I'd much rather do something more important than that.

I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.

But much more than that, I'd like our monitoring system to be built for humans, reducing the barrier of entry for adding monitoring and metrics to an application and to infrastructure without much hassle. How we'll get there?

Looking at the current state of monitoring, there's a strong focus on technology, which is great, because it helps solves bigger issues like data storage, visualization and presentation, and stream analysis. I'd love to see this all converge on the single thing that has to make the call in the end: a human. Helping them make a good decision and getting there should be very high on our list.

There is a fallacy in this wish though. With more automation comes a cognitive bias to trust what the system is telling me. Can the data presented to me be fully trusted? Did the system actually make the right call in sending me an alert? This is only something a human can figure, just as a coffee roaster needs to trust his instincts even though the variables for every roast are slightly different.

We want to avoid for our users having to have a piece of paper around that tells them exactly what happened the last time this alert was triggered. We want to make sure they don't have to look at samples of beans at different stages to find confirmation for the problem at hand. If the end user always looks at previous samples of data to compare it to the most recent one, the only thing they'll look for is confirmation.

Lastly, the interfaces of the monitoring tools we work with every day are designed to be efficient, they're designed to dazzle with visualization, yet they're still far from being easy to use. If we want everyone in our company to be able to participate in running a system in production, we have to make sure the systems we provide them with interfaces that treat them as what they are: people.

But most importantly, I'd like to see the word spread on monitoring and metrics, making our user interfaces more accessible and tell the tale of how we monitor our systems, how other people can monitor their systems. There's a lot to learn from each other, and I love things like hangops and OpsSchool, they're great starts to get the word out.

Because it's easier to write things down to realize where you are, to figure out where you want to be.

Failure is still one of the most undervalued things in our business, in most businesses really. We still tend to point fingers elsewhere, blame the other department, or try anything to cover our asses.

How about we do something else instead? We embrace failure openly, turn it into our company's culture and do everything we can to make sure every failure is turned into a learning experience, into an opportunity?

Let me start with some illustrating examples.

Wings of Fury

In 2010, Boeing tested the wings of a brand new 787 Dreamliner. In a giant hangar, they set up a contraption that'd pull the wings of a 787 up, with so much pull that the wings were bound to break.

Eventually, and after they've been flexed upwards of 25 feet, the wings broke spectacularly.

The amazing bit: all the engineers watching it happen started to cheer and applaud.

Why? Because they anticipated the failure at the exact circumstances where it broke, at about 150% of what wings handle at normal operation.

They can break things loud and proud, they can predict when their engineering work falls apart. Can we do the same?

Safety first

I've been reading a great book, "The Power of Habit", and it outlines another story of failure and how tackling that was turned into an opportunity to improve company culture.

When Paul O'Neill, later to become Secretary of the Treasury, took over management of Alcoa, one of the United States' largest aluminum production companies, he made it his first and foremost to tackle the safety issues in the company's production plants.

He put rules in place that any accidents must be reported to him within just a few hours, including remedies on how this kind of accident will be prevented in the future.

While his main focus was to prevent failures, because they would harm or even kill workers, what he eventually managed to do is to implement a company culture where even the smallest suggestions to improve safety or to improve efficiency from any worker would be considered and would be handed up the chain of management.

This fostered a culture of highly increased communication between production plants, between managers, between workers.

Failures and accidents still happened, but were in sharp decline, as every single one was taken as an opportunity to learn and improve the situation to prevent them from happening again.

It was a chain of post-mortems if you will. O'Neill's interest was to make everyone part of improving the overall situation without having to fear blame. Everyone was made felt like they're an important part of the company. By then, 15000 people worked at Alcoa.

This had an interesting effect on the company. In twelve years, O'Neill managed to increase Alcoa's revenues from $1.5 to $23 billion dollars.

His policies became an integral part of the company's culture and ensured that everyone working for it felt like an integral part of the production chain.

Floor worker's were given permission to shut down the production chain if they deemed it necessary and were encouraged to whistle when they noticed even the slightest risk in any activity in the company's facilities.

To be quite fair, competitors were pretty much in the dark about these practices, which gave Alcoa a great advantage on the market.

But within a decade of running the company, he transformed it into a culture that sounds strikingly similar to the ideas of DevOps. He managed to make everyone feel responsible for delivering a great product and for everyone to be enabled to take charge should something go wrong.

All that is based on the premise of trust. Trust that when someone speaks up, they will be taken seriously.

Three Habits of Failure

If you look at the examples above, some patterns come up. There are companies outside of our field that have mastered or at least taken on an attitude of accepting that failure is inevitable, anticipating failure and dealing with and learning from failure.

Looking at some more examples it occurred to me that even doing one of these things will improve your company's culture significantly.

How do we fare?

We fail, a lot. It's in the nature of the hardware we use and the software we build. Networks partition, hard drives fail, software bugs creep into system that can lead to cascading failures.

But do we, as a community, take enough of advantage of what we learn from each outage?

Does your company hold post-mortem meetings after a production outage? Do you write public post-mortems for your customers?

If you don't, what's keeping you from doing so? Is it fear of giving your competitors an advantage? Is it fear of giving away too many internal details? Fear of admitting fault in public?

There's a great advantage in making this information public. Usually, it doesn't really concern your customers what happened in all detail. What does concern them is knowing that you're in control of the situation.

A post-mortem follows three Rs: regret, reason and remedy.

They're a means to say sorry to your customers, to tell them that you know what caused the issues and how you're going to fix them.

On the other hand, post-mortems are a great learning opportunity for your peer ops and development people.

Web Operations

This learning is an important part of improving the awareness of web operations, especially during development. There's a great deal to be learned from other people's experiences.

Web operations is a field that is mostly learning by doing right now. Which is an important part of the profession, without a doubt.

If you look at the available books, there are currently three books that give insight into what it means to build and run reliable and scalable systems.

"Release It!", "Web Operations" and "Scalable Internet Architectures" are the ones that come to mind.

My personal favorite is "Release It!", because it raises developer awareness on how to handle and prevent production issues in code.

It's great to see the circuit breaker and the bulkhead pattern introduced in this book now being popularized by Netflix, who openly write about their experiences implementing it.

Netflix is a great example here. They're very open about what they do, they write detailed post-mortems when there's an outage. You should read their engineering blog, same for Etsy's.

Why? Because it attracts engineering talent.

If you're looking for a job, which company would you rather work for? One that encourages taking risks while also taking responsibility for fixing issues when failure does come up, and one that enables a culture of fixing and improving issues as a whole rather than to put blame?

I'd certainly choose the former.

Over the last two years, Amazon has also realized how important this is. Their post-mortems have gotten very valuable for anyone interest in things that can happen in multi-tenant, distributed systems.

If you remember the most recent outage on Christmas Eve, they even had the guts to come out and say that production data was deleted by accident.

Can you imagine the shame these developers must feel? But can you imagine a culture where the issue itself is considered an opportunity to learn instead of blaming or firing you? If only to learn that accessing production data needs stricter policies.

It's a culture I'd love to see fostered in every company.

Regarding ops education, there have been some great things last year that are worth mentioning. hangops is a nice little circle, streamed live (mostly) every Friday, and available for anyone to watch on YouTube afterwards.

Ops School has started a great collection of introductory material on operations topics. It's still very young, but it's a great start, and you can help move it forward.

Travis CI

At Travis CI, we're learning from failure, a lot. As a continuous integration platform, it started out as a hobby project and was built with a lot of positive assumptions.

It used to be a distributed system that always assumed everything would work correctly all the time.

As we grew and added more languages and more projects, this ideal fell apart pretty quickly.

It is a symptom of a lot of projects that are developer-driven, because there's just so little public information on how to do it right, on how distributed systems are built and run at other companies for them to work reliably.

We decided to turn every failure into an opportunity to share our learnings. We're an open source project, so it only makes sense to be open about our problems too.

Our audience and customers, who are mostly developers themselves, seem to appreciate that. I for one am convinced that we owe to them.

I encourage you to do the same, to share details on your development, on how you run your systems. It'll be surprising how introducing these changes can affect working as a team as a whole.

Cultural evolution

This insight didn't come easy. We're a small team, and we were all on board with the general idea of openness about our operational work and about the failures in our system.

That openness brings with it the need to own your systems, to own your failures. It took a while for us to get used to working together as a team to get these issues out of the way as quickly as possible and to find a path for a fix.

In the beginning, it was still too easy to look elsewhere for the cause of the problem. Blame is one side of the story, hindsight bias is the other. It's too easy to point out that the issue has been brought up in the past, but that doesn't contribute anything to fixing it.

The more helpful attitude than saying "I've been saying this has been broken for months" is to say "Here's how I'll fix it." You own your failures.

The only thing that matters is delivering value to the customer. Putting aside blame and admitting fault while doing everything you can to make sure the issue is under control is, in my opinion, the only way how you can do that, with everyone in your company on board.

Accepting this might just help transform your company's culture significantly.