A couple of years ago, I was introduced to the idea and practice of post-mortems through a talk by John Allspaw. I do owe him a lot, he's an inspiration.

Back then he talked about post-mortems as a company-internal means to triage what lead to and happened during an outage or production incident.

A post-mortem is a meeting where all stakeholders can and should be present, and where people should bring together their view of the situation and the facts that were found during and after the incident. The purpose is to collect as much data as possible and to figure out how the impact of a similar future incident can be reduced.

One precondition for a useful post-mortem is that it must be blameless. The purpose is not to put blame on anyone on the team, the purpose is to figure out what happened and how to improve it.

This can have significant impact on a company's culture. Speaking for myself, the idea of blameless post-mortems has changed the way how I think about operations, how I think about working on a team, how I think about running a company.

Humans Have Good Intentions

The normal focus of any team is to deliver value, either to customers, to stakeholders, to other teams in the same organization.

During an outage, this view can unfortunately change fast, and for the worse, mostly unintentionally. Being under pressure, it's easy to put blame on someone. Someone may have accidentally changed a configuration on a production system, someone accidentally dropped a database table on the production database.

But the most important idea of a blameless post-mortem is that humans are generally well-intentioned. When someone does something wrong, the assumption should be that it happened through circumstances beyond a single human.

Humans usually act in good faith, to the best of their knowledge, and within the constraints and world view of an organization.

That's where you need to start looking for problems.

Disregard the notion of human error. It's not helpful to find out what's broken and what you can fix. It assumes that what's broken and what needs to be fixed are humans in an organization.

Humans work with and in complex systems. Whether it's your organization or the production environment. Both feed into each other, both influence each other. The humans acting in these systems are triggers for behaviours that no one has foreseen, that no one can possibly foresee. But that makes humans mere actors, parts of complex systems that can be influenced by an infinite amount of factors.

That's where you need to start looking.

It's complex systems all the way down. The people on your team and in your company are trying their best to make sense of what they do, about how they interact with each other.


With the idea that humans are generally well-intentioned working in an organization, comes a different idea.

The idea of trust.

When you entrust a team with running your production systems, but you don't trust them to make the right decisions when things go wrong, or even when things go right, you're in deep cultural trouble.

Trust means that mistakes aren't punishable acts, they're opportunities to learn.

Trust means that everyone on the team, especially the people working at the sharp end of the action, can speak up, point out issues, work on improving them.

Trust means that everyone on the team will jump in to help rather than complain when there's a production incident.

Focus on Improvement

In the old world, we focused on mean time between failure (MBTF), on maximizing the time between production incidents. We also focused on trying to figure out who was at fault, who is to blame for an issue. Firing the person usually was the most logical consequence.

In the new world, we focus on learning from incidents, on improving the environment in an organization, the environment its people are working in. The environment that contributes to how people working in it behave both during normal work and during stressful situations.

When systems are designed, no one can predict all the ways in which they behave. There are too many factors contributing to how a system runs in production, to how your organization behaves as a whole. It's impossible to foresee all of them.

Running a system in production, going through outages, doing post-mortems, all these contribute to a continuous process of improving, of learning about those systems.

Bonus: Get Rid of the Root Cause

A common notion is still that there is this one single cause that you can blame a failure on, the infamous root cause. Whether it's a human, whether it's a component in your system, something can be blamed, and fixing it will make the problem go away.

With the idea of complex systems, this notion is bogus. There is no single root cause.

As Sidney Dekker put it:

What you call root cause is simply the place where you stop looking any further.

With so many complex systems interacting with each other, your organization, the teams and people in it, the production environment, other environment it interacts with. Too much input from one of these systems can trigger an unexpected behaviour in another part.

As Richard Cook put it in "How Complex Systems Fail":

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

While they won't transform your organization overnight, accepting post-mortems into your operational workflow can be a catalyst for long term change.

Tags: ops

I've been reading "Thinking, Fast and Slow", which is an interesting exploration of how the mind works. The book introduces two systems that make up how we think.

One is fast and intuitive, it's effortless to use. It monitors our surroundings, processes data, finds possible correlations, letting us draw quick conclusions.

The other is slow and requires effort to work with. It's the one that empathy requires us to use, for instance. When we need to really think about a problem, contemplate something another person tells us, this system is being used.

Two quotes in the book struck out at me:

A reliable way to make people believe in falsehoods is frequent repetition, because familiarity is not easily distinguished from truth.

This one as well:

When uncertain, System 1 bets on an answer, and the bets are guided by experience. The rules of the betting are intelligent: recent events and the current context have the most weight in determining an interpretation. When no recent event comes to mind, more distant memories govern.

Here's the last one:

The confirmatory bias of System 1 favors uncritical acceptance of suggestions and exaggeration of the likelihood of extreme and improbable events.

I've been contemplating how this explains our behaviour in stressful situations in particular, during a production outage, for instance.

The three quotes play quite well together, and they do make sense if you're familiar with handling an outage situation.

When something is broken, we tend to look at metrics and logs first to find something that strikes out at us, unusual patterns. We look for patterns that match previous experiences, and if we see any of those patterns, we tend to be quick to draw conclusions based on the patterns and the previous experiences.

Confirmation bias is the name for this phenomenon, presenting an interesting challenge for more than handling production outages. Given that System 1 is the one that drives our initial impressions of any situation, it seems impossible to overcome.

Another quote from the book:

System 1 is highly adept in one form of thinking — it automatically and effortlessly identifies causal connections between events, sometimes even when the connection is spurious.

Sometimes we come across scenarios that we find an explanation for, but it may not even be the correct one. When we run into this scenario again, drawing the same conclusion requires less and less effort, as our system 1 is quick to recognize the pattern and give us a solution, even if it's the wrong one.

Repetition of events and our own conclusions, as false as they may be, only ends up to us believing the falsehood even more.

Your beliefs, and even your emotional attitude, may change (at least a little) when you learn that the risk of an activity you disliked is smaller than you thought.

When you put these things together, they can form a provoking thought. I've been writing about practical drift recently. According to Scott Snook in "Friendly Fire", practical drift is the slow steady uncoupling of practice from written procedure.

Suppose you run into an incident you haven't seen before. It's been an unknown risk until then, and in turn, you didn't know what the potential consequences were, or you misjudged the risk of this particular scenario.

It did happen, the incident had a certain impact, and you learned some of the contributing factors and signals that helped you identify the problem. They're now your freshest memory of a recent incident.

The next incident with similar signals is sure to guide you into a similar direction, as that's the influence of the System 1.

As these issues continue to come up, possibly untreated, unfixed, you and your organization get used to the signals and the incident itself. There's now an unwritten procedure on how to handle this particular situation. In the worst case, someone gets an alert, ignores it, thinking "It'll clear."

This is where you slowly and collectively drift away from your initial assumption that this kind of incident is risky. System 1 continues to help you to identify the patterns and come up with the solution. It worked the last time, right?

It's fascinating that, when you look at practical drift from this perspective, it seems inevitable.

Tags: ops, humanfactors

For the last two years, I've been working on Travis CI, a hosted continuous integration and deployment platform. It first started out as a free service for open source projects on GitHub, but has since then evolved into a hosted product for private projects as well.

The fascinating bit for me was, right from the start, that the entire platform, inarguably an infrastructure product, is built and runs on top of other infrastructure products.

Travis CI, for the most part, runs on infrastructure managed and operated by other people.

Most of the code runs on Heroku, our RabbitMQ is hosted by CloudAMQP, our database is run by Heroku Postgres, our build servers are managed by Blue Box, our logs go into Papertrail, our metrics to Librato, our alerts come from OpsGenie, our status page is hosted on StatusPage.io, even our Chef server, the one bit that we use to customize some of our servers, is hosted.

In a recent Hangops, we talked about buying vs. building infrastructure. I thought it's well worth elaborating more on these things, why we went on to build Travis CI on top of other infrastructure products rather than build and automate these things ourselves.

Operation Expenses vs. Time Spent Building

The most obvious reason why you'd want to buy rather than build is to save time.

In a young company trying to build out a product, anything that saves you time but adds value is worth paying for.

You can build anything, if you have the time to do it.

This is an important trade-off. You're spending money, possibly on a monthly basis to use a service rather than spend the time building it yourself.

A status page is a classic example. Surely it should be doable to build it myself in just a few days, yes?

But then, to be really useful, your custom status page needs an API, for easy interaction with your Hubot. Maybe you also want a way to integrate metrics from external sources, and you want it to include things like scheduled maintenances.

On top of that, (hopefully) a pretty user interface.

That's the better of two weeks, if not more than that. On top of that, you need to run it in production too. It's one more distraction from running your core product.

Other people may be able to do this a lot better than you. They help you free up time to work on things that are relevant to your products and your customers.

In return, you pay a monthly fee.

Surely, you say, building it yourself is practically free compared to a monthly fee, isn't it?

Your time is very valuable. It's more valuable spent on your own product rather than build other things around it.

The problem with time spent building things is that you can't put a number on it. You're basically comparing a monthly fee to what looks like a big fat zero. Because heck, you built it yourself, it didn't cost anything.

This is the classic tradeoff of using paid services. You could very well build it yourself, but you could also spend your time on other things. You could also use and run an open source solution, but that too needs to be maintained, operated and upgraded.

If this sounds one-sided, that's unintentional. I have a history of building things myself, racking my own servers, provisiong them myself.

But there are things to keep in mind when it comes to spending time building things. There's a non-zero cost attached to this, it's just not visible as the monthly invoice you're getting from a service. That cost is hard to fathom as it's hard to put a numeric value on the time spent building it.

When you have the resources and can afford to, it makes sense to start pulling things in-house.

For us, not having to take care of a big chunk of our infrastructure ourselves is a big benefit, allowing us to focus on the more relevant bits.

But letting other folks run core parts of your infrastructure doesn't come without risks either.

Risks of Downtime and Maintenance

When you buy into the idea of someone else maintaining more or less vital parts of your infrastructure, there's a risk involved.

You're bound to any problems they might have with their infrastructure, with their code. In multi-tenant systems, any operational issues tend to ripple through the system and affect several customers rather than just one.

You're also bound to their maintenance schedules. Amazon's RDS service, for this particular scenario, allows you to specify a maintenance window through their API for your database instances.

The full risk of how this affects your own application is hard, if not impossible, to calculate.

A part of your infrastructure could go down at any time, and it's mostly out of your hands to respond to it. What you can and should do, is harden your code to work around them, if at all possible.

One question to ask is how vital this particular piece of infrastructure is to your application and therefore, your business.

If it's in the critical path, if it affects your application significantly when it goes down, there are options. Not all multi-tenant systems are automatically multi-tenant. Some offer the ability to have dedicated but managed setups. Some even offer high availability capabilities to reduce the impact of single nodes going down.

Both our PostgreSQL database and our RabbitMQ setup are critical parts of Travis CI. Without the database, we can't store or read any data. Without our message queue, we can't push build logs and build jobs through the system, effectively leaving the system unable to run any tests for our customers.

We started out on multi-tenant setups for both. On our PostgreSQL database, the load was eventually way too high for the small size of the database setup.

For our RabbitMQ, we were easily impacted by other clients in the system. RabbitMQ in particular can be gnarly to work with when lots of clients share the same cluster. On client producing an unusual amount of messages can grind everyone else in the system to a grinding halt.

Eventually, we ran both parts on dedicated infrastructure, but still fully managed. There's still a chance of things going down, of course. But the impact is less than if an entire multi-tenant cluster goes down.

Putting parts that were in the critical path on dedicated infrastructure has been working well for us. The costs certainly went up, but we just couldn't continue telling excuses on why Travis CI was down.

When it comes to buying into other people running your infrastructure, don't be afraid to ask how they manage it. Do they have a status page that is actively used? How do they handle downtimes?

Operational openness is important when other people manage parts of your infrastructure.

It's inevitable that something bad will happen in their infrastructure that affects you. How they deal with these scenarios is what's relevant.

Security and Privacy

With multi-tenant infrastructure, you're confronted with curious challenges, and they can affect you in ways that only studying your local laws and the provider's terms of service will fully reveal.

Security and privacy are two big issues you need to think about when entrusting your data to a third party. The recent MongoHQ security incident has brought up this issue in an unprecendented way, and we've had our own issues with security in the past.

Note that these issues could come up just the same when you're running your own infrastructure. But just like outages, security and privacy breaches can have much wider ranging ripple effects on multi-tenant infrastructure.

How can you best handle this? Encrypting your data is one way to approach the situation. Encrypt anything that's confidential, that you want to protect with at least one small extra layer of security to reduce the attack surface on it.

We encrypt SSH keys and OAuth tokens, the most private data that's entrusted to our systems. Of course, the keys aren't stored in the database.

When buying infrastructure rather than building it, keep a good eye on what your providers do and how they handle security and your data. This is just as important as handling outages, if not even more so.

Make sure that your privacy/security statements reflect which services you're using and how you handle your customers' data with them. It may not sound like much, but transparency goes a long way.

One unfortunate downside of infrastructure services, Heroku add-ons come to mind, is the lack of fine-grained access privileges. Only some of the addons we use allow us to create separate user accounts with separate permissions.

It's one of the downsides of the convenience of just having a URL added to your application's environment and start using an add-on.

Judging the impact of the trade-off is, again, up to you. Sometimes convenience trumps security, but other times (most times?), security is more important than convenience.

Your users' data is important to your users, so it should be just as important to you.

Scaling up and out

We started out small, with just a few Heroku dynos and a small database setup, a shared RabbitMQ setup to boot.

In fact, initially Travis CI ran on just one dyno, then two, then just a few more when a second application was split out.

This worked up to a few thousand tests per day. But as we scaled up, that wasn't sufficient.

I was sceptical at first whether we can scale up while remaining on managed infrastructure rather than build our own. Almost two years later, it's still working quite well.

Important bits have been moved to dedicated setups, the databases (we have four clusters, eight database servers in total) and our RabbitMQ service, which we needed to move to a cluster setup.

Most hosted services give you means to scale up. For Heroku apps, you add more dynos, or you increase the capacity of a single dyno.

For their databases (or Amazon RDS, for that matter), upgrade the underlying server, simple enough to do. For RabbitMQ, go for a bigger plan that gives more dedicated resources, higher throughput, and the like.

Figuring out the limits of hosted infrastructure services is hard. If you send a log service, even by mistake, thousands of messages per second, how do they respond? Can they handle it?

Only one way to find out, ask them!

With most of the bits that we need to scale out, we're confident that hosted services will give us the means to do so for quite some time. After that, we can still talk to them and figure out what we can do beyond their normal offerings.

Scaling up is a challenge, as Joe Ruscio put it on the aforementioned hangops episode: "Scaling is always violent."

It was violent on occasion for us as well.

We may need more dedicated bits in the future for specialized use, things like ZooKeeper for distributed consensus. But most of our tools are still running nicely on hosted infrastructure.

Operational insight

One thing that's been bugging me about a few of our core services originally was the lack of operational insight.

With infrastructure beyond your control, getting insight into what's happening can be challenging.

We had to ask Heroku support quite a few times for insight into our database host machine. For figuring out whether or not an upgrade to a larger plan or instance is required, this can be essential. It certainly was for us. This situation has been improving and it will be even more in the future from what I've heard.

But for an infrastructure provider, offering this kind of insight can also be challenging. Heroku's Postgres has improved quite a lot, and we get better insight into what's happening in our database now thanks to datascope and their means of dumping metrics into the logs, which you can then aggregate with a service like Librato.

Most providers have great people working for them. When in doubt, ask them about anything that's on your mind. The services we work with are usually very helpful and insightful. The Heroku Postgres team is a knowledge goldmine in itself.