Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this conference, I'm far from being a monitoring expert. If anything, I'm a user, a tinkerer of all the great tools we're hearing about at this conference.

I help run a little continuous integration service called Travis CI. For that purpose I built several home-baked things that help us collect metrics and trigger alerts.

I want to start with a little story. I spend quality time at coffee shops and I enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to the big roasting machine they commonly have a laptop with pretty graphs showing how the temperature in the roaster changes over time. On two occasions I found myself telling them: "Hey cool, I like graphs too!"

On the first occasion I looked at the graph and noticed that it'd update itself every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd really love it if it could update every second." In just two seconds the temperature in the roaster can already drop by almost a degree (Celsius), so he was lacking the granularity to get the best insight into his system.

The second roaster did have one second resolution, and I swooned. But I noticed that every minute or so, he wrote down the current temperature on a sheet of paper. The first guy had done that too. I was curious why they'd do that. He told me that he took it as his reference sheet for the next roasting batch. I asked why he didn't have the data stored in the system. He replied that he didn't trust it enough, because if it lost the information he wouldn't have a reference for his next roasting sheet.

He also keeps a set of coffee bean samples around from previous roasts, roasts where the outcome is known to have resulted in a great roasting result. Even coffee roasters have confirmation bias, though to be fully fair, when you're new to the job, any sort of reference can help you move forward.

This was quite curious. They had the technology yet they didn't trust it enough with their data. But heck, they had one-second resolution and they had the technology to measure data from live sensors in real time.

During my first jobs as a developer touching infrastructure, five minute collection intervals and RRDtool graphs were still very much en vogue. My alerts basically came from Monit throwing unhelpful emails at me stating that some process just changed from one state to another.

Since my days with Munin a lot has changed. We went through the era of #monitoringsucks, which fortunately, quickly turned into the era of #monitoringlove. It's been pretty incredible watching this progress as someone who loves tinkering with new and shiny tools and visualization possibilities. We've seen the emergence of crazy new visualization ideas, like the horizon chart, and we've seen the steady rise of using modern web technologies to render charts, while seeing RRDtool being taken to the next level to visualize time series data.

New approaches providing incredibly detailed insight into network traffic and providing stream analysis of time series data have emerged.

One second resolution is what we're all craving, looking at beautiful and constantly updating charts of 95th percentile values.

And yet, how many of you are still using Nagios?

There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.

Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.

There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring something that's easy to get into for people new to the field.

Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts. In the end, you still find yourself looking at one or more graphs trying to figure out what the hell it means.

Tracking metrics has become very popular, thanks to Coda Hale's metrics library, which inspired a whole slew of libraries for all kinds of languages, and tools like StatsD, which made it very easy to throw any kind of metric at them and have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.

Yet the biggest question that I get every time I talk to someone about monitoring, in particular people new to the idea, is: "what should I even monitor?"

With all the tools we have at hand, helping people to find the data that matters for their systems is still among the biggest hurdles that must be conquered to actually make sense of metrics.

Can we do a better job of educating people what they should track, what they could track, and how they can figure out the most important metrics for their system? It took us six months to find the single metric that best reflects the current state of our system. I called it the soul metric, the one metric that matters most to our users and customers.

We started tracking the time since the last build was started and since the last build was finished.

On our commercial platform, where customers run builds for their own products and customer projects, the weekend is very quiet. We only run one tenth of the number of builds on a Sunday compared to a normal weekday. Sometimes we don't run any build in 60 minutes. Suddenly checking when a build was last triggered makes a lot less sense.

Suddenly we're confronted with the issue that we need to look at multiple metrics in the same context to see if a build should even have been started, as the fact itself is solely based on a customer pushing code. We're suddenly looking at measuring the absence of data (no new commits) and correlate it with data derived from several attributes of the system, like no running builds and no build request being processed.

The only reasonable solution I could come up with, and it's mostly thanks to talking to Eric from Papertrail, is if you need to measure something but it require the existence of an activity, you have to make sure this activity is generated on a regular basis.

In hindsight, it's so obvious, though it brings up a question: if the thing that generates the activity fails, does that mean the system isn't working? Is this worth an alert, is this worth waking someone up for? Certainly not.

This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong?

If a coffee roaster doesn't trust his tools enough to give him a consistent insight into the current, past and future roasting batches, isn't that a weird mismatch between humans and the system that's supposed to give them the assurance that they're on the right path?

A roaster still trusts his instincts more than he trusts the data presented to him. After all, it's all about the resulting coffee bean.

Where does that take us and the current state of monitoring?

We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert.

As much as I enjoy staring at graphs, I'd much rather do something more important than that.

I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.

But much more than that, I'd like our monitoring system to be built for humans, reducing the barrier of entry for adding monitoring and metrics to an application and to infrastructure without much hassle. How we'll get there?

Looking at the current state of monitoring, there's a strong focus on technology, which is great, because it helps solves bigger issues like data storage, visualization and presentation, and stream analysis. I'd love to see this all converge on the single thing that has to make the call in the end: a human. Helping them make a good decision and getting there should be very high on our list.

There is a fallacy in this wish though. With more automation comes a cognitive bias to trust what the system is telling me. Can the data presented to me be fully trusted? Did the system actually make the right call in sending me an alert? This is only something a human can figure, just as a coffee roaster needs to trust his instincts even though the variables for every roast are slightly different.

We want to avoid for our users having to have a piece of paper around that tells them exactly what happened the last time this alert was triggered. We want to make sure they don't have to look at samples of beans at different stages to find confirmation for the problem at hand. If the end user always looks at previous samples of data to compare it to the most recent one, the only thing they'll look for is confirmation.

Lastly, the interfaces of the monitoring tools we work with every day are designed to be efficient, they're designed to dazzle with visualization, yet they're still far from being easy to use. If we want everyone in our company to be able to participate in running a system in production, we have to make sure the systems we provide them with interfaces that treat them as what they are: people.

But most importantly, I'd like to see the word spread on monitoring and metrics, making our user interfaces more accessible and tell the tale of how we monitor our systems, how other people can monitor their systems. There's a lot to learn from each other, and I love things like hangops and OpsSchool, they're great starts to get the word out.

Because it's easier to write things down to realize where you are, to figure out where you want to be.

Failure is still one of the most undervalued things in our business, in most businesses really. We still tend to point fingers elsewhere, blame the other department, or try anything to cover our asses.

How about we do something else instead? We embrace failure openly, turn it into our company's culture and do everything we can to make sure every failure is turned into a learning experience, into an opportunity?

Let me start with some illustrating examples.

Wings of Fury

In 2010, Boeing tested the wings of a brand new 787 Dreamliner. In a giant hangar, they set up a contraption that'd pull the wings of a 787 up, with so much pull that the wings were bound to break.

Eventually, and after they've been flexed upwards of 25 feet, the wings broke spectacularly.

The amazing bit: all the engineers watching it happen started to cheer and applaud.

Why? Because they anticipated the failure at the exact circumstances where it broke, at about 150% of what wings handle at normal operation.

They can break things loud and proud, they can predict when their engineering work falls apart. Can we do the same?

Safety first

I've been reading a great book, "The Power of Habit", and it outlines another story of failure and how tackling that was turned into an opportunity to improve company culture.

When Paul O'Neill, later to become Secretary of the Treasury, took over management of Alcoa, one of the United States' largest aluminum production companies, he made it his first and foremost to tackle the safety issues in the company's production plants.

He put rules in place that any accidents must be reported to him within just a few hours, including remedies on how this kind of accident will be prevented in the future.

While his main focus was to prevent failures, because they would harm or even kill workers, what he eventually managed to do is to implement a company culture where even the smallest suggestions to improve safety or to improve efficiency from any worker would be considered and would be handed up the chain of management.

This fostered a culture of highly increased communication between production plants, between managers, between workers.

Failures and accidents still happened, but were in sharp decline, as every single one was taken as an opportunity to learn and improve the situation to prevent them from happening again.

It was a chain of post-mortems if you will. O'Neill's interest was to make everyone part of improving the overall situation without having to fear blame. Everyone was made felt like they're an important part of the company. By then, 15000 people worked at Alcoa.

This had an interesting effect on the company. In twelve years, O'Neill managed to increase Alcoa's revenues from $1.5 to $23 billion dollars.

His policies became an integral part of the company's culture and ensured that everyone working for it felt like an integral part of the production chain.

Floor worker's were given permission to shut down the production chain if they deemed it necessary and were encouraged to whistle when they noticed even the slightest risk in any activity in the company's facilities.

To be quite fair, competitors were pretty much in the dark about these practices, which gave Alcoa a great advantage on the market.

But within a decade of running the company, he transformed it into a culture that sounds strikingly similar to the ideas of DevOps. He managed to make everyone feel responsible for delivering a great product and for everyone to be enabled to take charge should something go wrong.

All that is based on the premise of trust. Trust that when someone speaks up, they will be taken seriously.

Three Habits of Failure

If you look at the examples above, some patterns come up. There are companies outside of our field that have mastered or at least taken on an attitude of accepting that failure is inevitable, anticipating failure and dealing with and learning from failure.

Looking at some more examples it occurred to me that even doing one of these things will improve your company's culture significantly.

How do we fare?

We fail, a lot. It's in the nature of the hardware we use and the software we build. Networks partition, hard drives fail, software bugs creep into system that can lead to cascading failures.

But do we, as a community, take enough of advantage of what we learn from each outage?

Does your company hold post-mortem meetings after a production outage? Do you write public post-mortems for your customers?

If you don't, what's keeping you from doing so? Is it fear of giving your competitors an advantage? Is it fear of giving away too many internal details? Fear of admitting fault in public?

There's a great advantage in making this information public. Usually, it doesn't really concern your customers what happened in all detail. What does concern them is knowing that you're in control of the situation.

A post-mortem follows three Rs: regret, reason and remedy.

They're a means to say sorry to your customers, to tell them that you know what caused the issues and how you're going to fix them.

On the other hand, post-mortems are a great learning opportunity for your peer ops and development people.

Web Operations

This learning is an important part of improving the awareness of web operations, especially during development. There's a great deal to be learned from other people's experiences.

Web operations is a field that is mostly learning by doing right now. Which is an important part of the profession, without a doubt.

If you look at the available books, there are currently three books that give insight into what it means to build and run reliable and scalable systems.

"Release It!", "Web Operations" and "Scalable Internet Architectures" are the ones that come to mind.

My personal favorite is "Release It!", because it raises developer awareness on how to handle and prevent production issues in code.

It's great to see the circuit breaker and the bulkhead pattern introduced in this book now being popularized by Netflix, who openly write about their experiences implementing it.

Netflix is a great example here. They're very open about what they do, they write detailed post-mortems when there's an outage. You should read their engineering blog, same for Etsy's.

Why? Because it attracts engineering talent.

If you're looking for a job, which company would you rather work for? One that encourages taking risks while also taking responsibility for fixing issues when failure does come up, and one that enables a culture of fixing and improving issues as a whole rather than to put blame?

I'd certainly choose the former.

Over the last two years, Amazon has also realized how important this is. Their post-mortems have gotten very valuable for anyone interest in things that can happen in multi-tenant, distributed systems.

If you remember the most recent outage on Christmas Eve, they even had the guts to come out and say that production data was deleted by accident.

Can you imagine the shame these developers must feel? But can you imagine a culture where the issue itself is considered an opportunity to learn instead of blaming or firing you? If only to learn that accessing production data needs stricter policies.

It's a culture I'd love to see fostered in every company.

Regarding ops education, there have been some great things last year that are worth mentioning. hangops is a nice little circle, streamed live (mostly) every Friday, and available for anyone to watch on YouTube afterwards.

Ops School has started a great collection of introductory material on operations topics. It's still very young, but it's a great start, and you can help move it forward.

Travis CI

At Travis CI, we're learning from failure, a lot. As a continuous integration platform, it started out as a hobby project and was built with a lot of positive assumptions.

It used to be a distributed system that always assumed everything would work correctly all the time.

As we grew and added more languages and more projects, this ideal fell apart pretty quickly.

It is a symptom of a lot of projects that are developer-driven, because there's just so little public information on how to do it right, on how distributed systems are built and run at other companies for them to work reliably.

We decided to turn every failure into an opportunity to share our learnings. We're an open source project, so it only makes sense to be open about our problems too.

Our audience and customers, who are mostly developers themselves, seem to appreciate that. I for one am convinced that we owe to them.

I encourage you to do the same, to share details on your development, on how you run your systems. It'll be surprising how introducing these changes can affect working as a team as a whole.

Cultural evolution

This insight didn't come easy. We're a small team, and we were all on board with the general idea of openness about our operational work and about the failures in our system.

That openness brings with it the need to own your systems, to own your failures. It took a while for us to get used to working together as a team to get these issues out of the way as quickly as possible and to find a path for a fix.

In the beginning, it was still too easy to look elsewhere for the cause of the problem. Blame is one side of the story, hindsight bias is the other. It's too easy to point out that the issue has been brought up in the past, but that doesn't contribute anything to fixing it.

The more helpful attitude than saying "I've been saying this has been broken for months" is to say "Here's how I'll fix it." You own your failures.

The only thing that matters is delivering value to the customer. Putting aside blame and admitting fault while doing everything you can to make sure the issue is under control is, in my opinion, the only way how you can do that, with everyone in your company on board.

Accepting this might just help transform your company's culture significantly.

Two years ago, I wrote about the virtues of monitoring. A lot has changed, a lot has improved, and I've certainly learned a lot since I wrote that initial overview on monitoring as a whole.

There have been a lot of improvements to existing tools, and new players entered the market of monitoring. Infrastructure as a whole got more and more interesting for service business around them.

On the other hand, awareness for monitoring, good metrics, logging and the like has been rising significantly.

At the same time #monitoringsucks raised awareness that a lot of monitoring tools are still stuck in the late nineties when it comes to user interface and the way they work.

Independent of new and old tools, I've had the pleasure of learning a lot more about the real virtues of monitoring, about how it affects daily work and how it evolves over time. This post is about discussing some of these insights.

Monitoring all the way down

When you start monitoring even just small parts of an application, the need for more detail and for information about what's going on in a system arises quickly. You start with an innocent number of application level metrics, add metrics for database and external API latencies, start tracking system level and business metrics.

As you add monitoring to one layer of the system, the need to get more insight into the layer below comes up sooner or later.

One layer has just been tackled recently in a way that's accessible for anyone: communication between services on the network. Boundary has built some pretty cool monitoring stuff that gives you incredibly detailed insight into how services talk to each other, by way of their protocol, how network traffic from inside and outside a network develops over time, and all that down to the second.

The real time view is pretty spectacular to behold.

If you go down even further on a single host, you get to the level where you can monitor disk latencies.

Or you could measure the effect of screaming at a disk array of a running system. dtrace is a pretty incredible tool, and I hope to see it spread and become widely available on Linux systems. It allows you to inject instrumentation into arbitrary parts of the host system, making it possible measure any system call without a lot of overhead.

Heck, even our customer support tool allows us to track metrics for response times, how many tickets and for how long each staff member handled.

It's easy to start obsessing about monitoring and metrics, but there comes a time, when you either realize that you've obsessed for all the right reasons, or you add more monitoring.

Mo' monitoring, mo' problems

The crux of monitoring more layers of a system is that with more monitoring, you can and will detect more issues.

Consider Boundary, for example. It gives you insight into a layer you haven't had insight before, at least not at that granular level. For example, round trip times of liveness traffic in a RabbitMQ cluster.

This gives you a whole new pile of data to obsess about. It's good because that insight is very valuable. But it requires more attention, and more issues require investigation.

You also need to learn how a system behaving normally is reflected in those new systems, and what constitutes unusual behaviour. It takes time to learn and to interpret the data correctly.

In the long run though, that investment is well worth it.

Monitoring is an ongoing process

When we started adding monitoring to Travis CI, we started small. But we quickly realized what metrics really matter and what parts of the application and the infrastructure around it needs more insight, more metrics, more logging.

With every new component deployed to production, new metrics need to be maintained, more logging and new alerting need to be put in place.

The same is true for new parts of the infrastructure. With every new system or service added, new data needs to be collected to ensure the service is running smoothly.

A lot of the experience of what metrics are important there and which aren't, it's something that develops over time. Metrics can come and go, the requirements for metrics are subject to change, just as they are for code.

As you add new metrics, old metrics might become less useful, or you need more metrics in other parts of the setup to make sense of the new ones.

It's a constant process of refining the data you need to have the best possible insight into a running system.

Monitoring can affect production systems

The more data you collect, with higher and higher resolution, the more you run the risk of affecting a running system. Business metrics regularly pulled from the database can become a burden on the database that's supposed to serve your customers.

Pulling data out of running systems is a traditional approach to monitoring, one that's unlikely to go away any time soon. However, it's an approach that's less and less feasible as you increase resolution of your data.

Guaranteeing that this collection process is low on resources is hard. It's even harder to get a system up and running that can handle high-resolution data from a lot of services sent concurrently.

So new approaches have started to pop up to tackle this problem. Instead of pulling data from running processes, the processes themselves collect data and regularly push it to aggregation services which in turn send the data to a system for further aggregation, graphing, and the like.

StatsD is without a doubt the most popular one, and it has sparked a ton of forks in different languages

Instead of relying on TCP with its long connection handshakes and timeouts, StatsD uses UDP. The processes sending data to it stuff short messages into a UDP socket without worrying about whether or not the data arrives.

If some data doesn't make it because of network issues, that only leaves a small dent. It's more important for the system to serve customers than for it to wait around for the aggregation service to become available again.

While StatsD solves the problem of easily collecting and aggregating data without affecting production systems, there's now the problem of being able to inspect the high-resolution data in meaningful ways. Historical analysis and alerting on high-resolution data becomes a whole new challenge.

Riemann has popularized looking at monitoring data as a stream, to which you can apply queries, and form reactions based on those queries. You can move the data window inside the stream back and forth, so you can compare data in a historical context before deciding on whether it's worth an alert or not.

Systems like StatsD and Riemann make it a lot easier for systems to aggregate data without having to rely on polling. Services can just transmit their data without worrying much about how and where they're used for other purposes like log aggregation, graphing or alerting.

The important realization is that with increasing need for scalability and distributed systems, software needs to be built with monitoring in mind.

Imagine RabbitMQ that instead of you having to poll the data from it, sends its metrics as a message at a configurable interval to a configurable fanout. You can choose to consume the data and submit it to a system like StatsD or Riemann, or you can ignore it and the broker will just discard the data.

Who's monitoring the monitoring?

Another fallacy of monitoring is that it needs to be reliable. For it to be fully reliable it needs to be monitored. Wait, what?

Every process that is required to aggregate metrics, to trigger alerts, to analyze logs needs to be running for the system to work properly.

So monitoring in turns needs its own supervision to make sure it's working at all times. As monitoring grows it requires maintenance and operations to take care of it.

Which makes it a bit of a burden for small teams.

Lots of new companies have sprung into life serving this need. Instead of having to worry about running services for logs, metrics and alerting by themselves, it can be left to companies who are more experienced in running them.

Librato Metrics, Papertrail, OpsGenie, LogEntries, Instrumental, NewRelic, DataDog, to name a few. Other companies take the burden of having to run your own Graphite system away from you.

It's been interesting to see new companies pop up in this field, and I'm looking forward to seeing this space develop. The competition from the commercial space is bound to trigger innovation and improvements on the open source front as well.

We're heavy users of external services for log aggregation, collecting metrics and alerting. Simply put, they know better how to run that platform than we do, and it allows us to focus on delivering the best possible customer value.

Monitoring is getting better

Lots of new tools have sprung up in the last two years. While development on it started earlier than that, the most prominent tools are probably Graphite and Logstash. Cubism brings new ideas on how to visualize time series data, one of the several dozens of dashboards that Graphite's existence and flexibility by offering an API has sparked. Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.

It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.

I'm looking forward to seeing how the monitoring space evolves over the next two years.

Over the last year, as we started turning Travis CI into a hosted product, we added a ton of metrics and monitoring. While we started out slow, we soon figured out which metrics are key and which are necessary to monitor the overall behavior of the system.

I built us a custom collector that rakes in metrics from our database and from the API exposed by RabbitMQ. It soon dawned on me that these are our core metrics, and that they need not only graphs, we need to be alerted when they cross thresholds.

The first iteration of that dumped alerts into Campfire. Given that we're a small team and the room might be empty at times, that was just not sufficient for an infrastructure platform that's used by customers and open source projects around the world, at any time of the day.

So we added alerting, by way of OpsGenie. It's set up to trigger alerts via iPhone push notifications and escalations via SMS, should an alert not have been acknowledged or closed within 10 minutes. Eventually, escalation needs to be done via voice calls so that someone really picks up. It's easy to miss a vibrating iPhone when you're sound asleep, but much harder so when it keeps on vibrating until someone picks up.

A Pager for every Developer

Just recently I read an interview with Werner Vogels on architecture and operations at Amazon. He said something that struck with me: "You build it, you run it."

That got me thinking. Should developers of platforms be fully involved in the operations side of things?

A quick survey on Twitter showed that there are some companies where developers are paged when there are production issues, others fully rely on their operations team.

There's merit to both, but I could think of a few reasons why developers should be carrying a pager just like operations does.

You stay connected to what your code does in production. When code is developed, the common tool to manage expectations is to write tests. Unfortunately, no unit test, no integration test will be fully able to reproduce circumstances of what your code is doing in production.

You start thinking about your code running. Reasoning about what a particular piece of code is doing under specific production circumstances is hard, but not entirely impossible. When you're the one responsible for having it run smoothly and serve your customers, this goes up to a whole new level.

Metrics, instrumentation, alerting, logging and error handling suddenly become a natural part of your coding workflow. You start making your software more operable, because you're the one who has to run it. While software should be easy to operate in any circumstances, it commonly isn't. When you're the one having to deal with production issues, that suddenly has a very different appeal.

Code beauty is suddenly a bit less important than making sure your code can treat errors, timeouts, increased latencies. Kind of an ironic twist like that. Code that's resilient to production issues might not have a pretty DSL, it might not be the most beautiful code, but it may be able to sustain whatever issue is thrown at it.

Last, when you're responsible for running things in production, you're forced to learn about the entire stack of an application, not just the code bits, but its runtime, the host system, hardware, network. All that turns into something that feels a lot more natural over time.

I consider that a good thing.

There'll always be situations where something needs to be escalated to the operations team, with deeper knowledge of the hardware, network and the like. But if code breaks in production, and it affects customers, developers should be on the front of fixing it, just like the operations team.

Even more so for teams that don't have any operations people on board. At some point, a simple exception tracker just doesn't cut it anymore, especially when no one gets paged on critical errors.

Being On Call

For small teams in particular, there's a pickle that needs to be solved: who gets up in the middle of the night when an alert goes off?

When you have just a few people on the team, like your average bootstrapping startup, does an on call schedule make sense? This is something I haven't fully figured out yet.

We're currently in the fortunate position that one of our team members is in New Zealand, but we have yet to find a good way to assign on call when he's out or for when he's back on this side of the world.

The folks at dotCloud have written about their schedule, thank you! Hey, you should share your pager and on-call experiences too!

Currently we have a first come first serve setup. When an alert comes in and someone sees it, it gets acknowledged and looked into. If that involves everyone coming online, that's okay for now.

However, it's not an ideal setup, because being able to handle an alert means being able to log into remote systems, restart apps, inspect the database, look at the monitoring charts. Thanks to iPhone and iPad most of that is already possible today.

But to be fully equipped to handle any situation, it's good to have a laptop at hand.

This brings up the question: who's carrying a laptop and when? Which in turns means that some sort of on-call schedule is still required.

We're still struggling on this, so I'd love to read more about how other companies and teams handle that.

Playbooks

During a recent hangops discussion, there was a chat about developers being on call. It brought up an interesting idea, a playbook on how to handle specific alerts.

It's a document explaining things to look into when an alert comes up. Ideally, an alert already includes a link to the relevant section in the book. This is something operations and developers should work on together to make sure all fronts are covered.

It takes away some of the scare of being on call, as you can be sure there's some guidance when an issue comes up.

It also helps refine monitoring and alerts and make sure there are appropriate measures available to handle any of them. If there are not, that part needs improving.

I'm planning on building a playbook for Travis as we go along and refine our monitoring and alerts, it's a neat idea.

Sleepless in Seattle

There's a psychological side to being on-call that needs a lot of getting used to: the thought that an alert could go off at any time. While that's a natural thing, as failures do happen all the time, it's easy to mess up your head. It certainly did that for me.

Lying in bed, not being able to sleep, because your mind is waiting for an alert, it's not a great feeling. It takes getting used to. It's also why having an on-call schedule is preferable over an all hands scenario. When only one person is on call, team mates can at least be sure to get a good night's sleep. As the schedule should be rotating, everyone gets to have that luxury on a regular basis.

It does one thing though: it pushes you to make sure alerts only go off for relevant issues. Not everything needs to be fixed right away, some issues could be taken care of by improving the code, others are only temporary fluxes because of increased network latency and will resolve themselves after just a few minutes. Alerting someone on every other exception raised doesn't cut it anymore, alerts need to be concise and only be triggered when the error is severe enough and affects customers directly. Getting this right is the hard part, and it takes time.

All that urges you to constantly improve your monitoring setup, to increase relevance of alerts, and to make sure that everyone on the team is aware of the issues, how they can come up and how they can be fixed.

It's a good thing.

Tags: operations

Recently, I've been thinking a lot about failure, my daughter, risk and punishment, and the whole culture that has evolved around trying to avoid failure, trying to point fingers or putting blame elsewhere.

Simplest example: my daughter spills something over the table. What's the first reaction? Scolding or punishment of sorts. I'm guilty as charged. I read something pretty simple and wonderful recently, a very short read titled "Father Forgets".

That read got me thinking: why do we tend to punish failure immediately? It's not just something to do with our kids, it's human nature. We tend to put blame elsewhere, we tend to get defensive because people turn to us to fix a problem, when something is broken in production, for example.

Why can't we instead make failure a part of our culture? Not just at home, with our kids, but in our work place?

As soon as people feel like they need to get defensive, or they're blamed for a problem that occurred due to a recent change of theirs, negativity hits everyone on the team. It's hard to stay calm, it's hard to stay focused on what really matters: that something is broken in production, affecting your customers.

As soon as people feel threatened or pressured, they get defensive or they feel down because some of their own code broke something. Their vision is clouded. Finding the problem's cause and implementing a solution is suddenly just a blur, something that's hard to focus on. Even though that's what really that matters.

When people feel like failure is not an option, they'll stop taking risks. When people stop taking risks, your team and your company is doomed, innovation comes to a grinding halt. Most of us are in the lucky position that lives don't depend on our work. We can try new things, iterate quickly, disregard or improve them.

If my daughter doesn't take any risks because I keep punishing or scolding her, she might just stop trying altogether. The analogy is an odd one, but there's a striking similarity.

If a problem comes up, you fix it, you learn your lesson, you make sure it doesn't happen again, you move on. It can be that simple. When everyone on the team feels like failure is an accepted part of running an application, fixing the problems as they occur as a team becomes a lot easier.

In the end, it's not a question of *if* something breaks, it's rather about when it breaks. And the answer is: all the time. Great teams focus on the one thing that matters in these situations: how to best resolve the situation and on being ready when it does.

Embrace outages, the most common failure of our craft. Take a deep breath, phase out distractions (including managers) and try to find joy in digging through data and finding what's causing a problem. Turn it from a seemingly frustrating experience into a personal challenge. You find the problem, you fix it, you make customers happy again. Rinse, repeat.

Failure is cool.

This post is not about devops, it's not about lean startups, it's not about web scale, it's not about the cloud, and it's not about continuous deployment. This post is about you, the developer who's main purpose in life has always been to build great web applications. In a pretty traditional world you write code, you write tests for it, you deploy, and you go home. Until now.

To tell you the truth, that world has never existed for me. In all of my developer life I had to deal with all aspects of deployment, not just putting build artifacts on servers, but dealing with network outages, faulty network drivers, crashing hard disks, sudden latency spikes, analyzing errors coming from those pesky crawling bots on that evil internet of yours. I take a lot of this for granted, but working in infrastructure and closely with developers trying to get applications and infrastructure up and running on EC2 has taught me some valuable lessons to assume the worst. Not because developers are stupid, but because they like to focus on code, not infrastructure.

But here's the deal: your code and all your full-stack and unit tests is worth squat if they're not running out there on some server or infrastructure stack like Google Apps or Heroku. Without running somewhere in production, your code doesn't generate any business value, it's just a big pile of ASCII or UTF-8 characters that cost a lot of money to create, but didn't offer any return of investment yet.

Love Thy Infrastructure

Operations isn't hard, but necessary. You don't need to know everything about operations to become fluent in it, you just have to know enough to start and know how to use Google.

This is my collective dump from the last years of working both as a developer and that guy who does deployments and manages servers too. Most are lessons I learned the hard way, others just seemed logical to me when I learned about them the first time around.

Between you and me, having this skill set at hand makes you a much more valuable developer. Being able to analyze any problem in production and at least having a basic skill set to deal with it makes you a great asset for companies and clients to hold on to. Thought you should know, but I digress.

The most important lesson I can tell you right up front: love your infrastructure, it's the muscles and bones of your application, whereas your code running on it is nothing more than the skin.

Without Infrastructure, No-one Will Use Your Application

Big surprise. For users to be able to enjoy your precious code, it needs to run somewhere. It needs to run on some sort of infrastructure, and it doesn't matter if you're managing it, or if you're paying another company to take care of it for you.

Everything Is Infrastructure

Every little piece of software and hardware that's necessary to make your application available to users is infrastructure. The application server serving and executing your code, the web server, your email delivery provider, the service that tracks errors and application metrics, the servers or virtual machines your services are running on.

Every little piece of it can break at any time, can stall at any time. The more pieces you have in your application puzzle, the more breaking points you have. And everything that can break, will break. Usually not all at once, but most certainly when it's the least expected, or just when you really need your application to be available.

On Day One, You Build The Hardware

Everything starts with a bare metal server, even that cloud you've heard so much about. Knowing your way around everything that's related to setting up a full rack of servers on a single day, including network storage a fully configured switch with two virtual LANs and a master-slave database setup using a RAID 10 a bunch of SAS drives might not be something you need every day, but it sure comes in handy.

The good news is the internet is here for you. You don't need to know everything about every piece of hardware out there, but you should be able to investigate strengths and weaknesses, when an SSD is an appropriate tool to use, and when SAS drives will kick butt.

Learn to distinguish the different levels of RAID, why having an additional file system buffer on top of a RAID that doesn't have a backup battery for its own, internal write buffer is a bad idea. That's a pretty good start, and will make decisions much easier.

The System

Do you know what swap space is? Do you know what happens when it's used by the operating system, and why it's actually a terrible thing and gives a false sense of security? Do you know what happens when all available memory is exhausted?

Let me tell you:

  • When all available memory is allocated, the operating system starts swapping out memory pages to swap space, which is located on disk, a very slow disk, slow like a snail compared to fast memory.
  • When lots of stuff is written to and read from swap space on disk, I/O wait goes through the roof, and processes start to pile up waiting for their memory pages to be swapped out to or read from disk, which in turn increases load average, and almost brings the system to a screeching halt, but only almost.
  • Swap is terrible because it gives you a false sense of having additional resources beyond the available memory, while what it really does is slowing down performance in a way that makes it almost impossible for you to log into the affected system and properly analyze the problem.

This is basically operations level on the operating system level. It's not much you need to know here, but in my opinion it's essential. Learn about the most important aspects of a Unix or Linux system. You don't need to know everything, you don't need to know the specifics of Linux' process scheduler or the underlying datastructure used for virtual memory. But the more you know, the more informed your decisions will be when the rubber hits the road.

And yes, I think enabling swap on servers is a terrible idea. Let processes crash when they don't have any resources left. That at least will allow you to analyze and fix.

Production Problems Don't Solve Themselves

Granted, sometimes they do, but you shouldn't be happy about that. You should be willing to dig into whatever data you have posthumous to find whatever went wrong, whatever caused a strange latency spike in database queries, or caused an unusually high amount of errors in your application.

When a problem doesn't solve itself though, which is certainly the common case, someone needs to solve it. Someone needs to look at all the available data to find out what's wrong with your application, your servers or the network.

This person is not the unlucky operations guy who's currently on call, because let's face it, smaller startups just don't have an operations team.

That person is you.

Solve Deployment First

When the first line of code is written, and the first piece of your application is ready to be pushed on a server for someone to see, solve the problem of deployment. This has never been easier than it is today, and being able to push incremental updates from then on speeds up development and the customer feedback cycle considerably.

As soon as you can, build that Capfile, Ant file, or whatever build and deployment tools you're using, set up servers, or set up your project on an infrastructure platform like Scalarium, Heroku, Google Apps, or dotCloud. The sooner you solve this problem, the easier it will be to finally push that code of yours into production for everyone to use. I consider application deployment a solved problem. There's no reason why you shouldn't have it in place even in the earliest stages of a project.

The more complex a project gets over even just its initial lifecycle the easier it will be to add more functionality to an existing deployment setup instead of having to build everything from scratch.

Automate, Automate, Automate

Everything you do by hand, you should only be doing once. If there's any chance that particular action will be repeated at some point, invest the time to turn it into a script. It doesn't matter if it's a shell, a Ruby, a Perl, or a Python script. Just make it reusable. Typing things into a shell manually, or updating configuration files with an editor on every single server is tedious work, work that you shouldn't be doing manually more than once.

When you automate something once, it not only greatly increases execution speed the second and third time around, it reduces the chance of failure, of missing that one important step.

There's an abundance of tools available to automate infrastructure, hand-written script are only the simplest part of it. Once you go beyond managing just one or two servers, tools like Chef, Puppet and MCollective come in very handy to automate everything from setting up bare servers to pushing out configuration changes from a single point, to deploying code. Everything should be properly automated with some tool. Ideally you only use one, but looking at Chef and Puppet, both have their strength and weaknesses.

Changes in Chef aren't instant, unless you use the command line tool knife, which assumes SSH access to all servers you're managing. The bigger your organizations the less chance you'll have to be able to access all machines via SSH. Instant tools like mCollective that work based on a push agent system, are much better for these instant kinds of activities.

It's not important what kind of tool you use to automate, what's important is that you do it in the first place.

By the way, if your operations team restricts SSH access to machines for developers, fix that. Developers need to be able to analyze and fix incidents just like the operations folks do. There's no valid point in denying SSH access to developers. Period.

Introduce New Infrastructure Carefully

Whenever you add a new component, a new feature to an application, you add a new point of failure. Be it a background task scheduler, a messaging queue, an image processing chain or asynchronous mail delivery, it can and it will fail.

It's always tempting to add shiny new tools to the mix. Developers are prone to trying out new tools even though they've not yet fully proven themselves in production, or experience running them is still sparse. It's a good thing in one way, because without people daring to use new tools everyone else won't be able to learn from their experiences (you do share those experiences, do you?).

But on the other hand, you'll live the curse of the early adopter. Instead of benefiting from existing knowledge, you're the one bringing the knowledge into existence. You'll experience all the bugs that are still lurking in the darker corners of that shiny new database or message queue system. You'll spend time developing tools and libraries to work with the new stuff, time you could just as well be spending working on generating new business value by using existing tools that do the job similarly well. If you do decide for a new tool, be prepared to degrade back to other tools in the case of failure.

No matter if old or new, adding more infrastructure always has the potential for more things to break. Whenever you add something, be sure to know what you're getting yourself into, be sure to have fallback procedures in place, be sure everyone knows about the risks and the benefits. When something that's still pretty new breaks, you're usually on your own.

Make Activities Repeatable

Every activity in your application that causes other, dependent activities to be executed, needs to be repeatable, either by the user, or through some sort of administrative interface, or automatically if feasible. Think user confirmation emails, generating monthly reports, background tasks like processing uploads. Every activity that's out of the normal cycle of fetching records from a datasource and updating them is bound to fail. Heck, even that cycle will fail at some point due to some odd error that only comes up every once in a blue moon.

When an activity is repeatable, it's much easier to deal with outages of single components. When it comes back up, simply re-execute the tasks that got stuck.

This, however, requires one important thing: every activity must be idempotent. It must have the same outcome no matters how often it's being run. It must know what steps were already taken before it broke the last time around. Whatever's already been done, it shouldn't be done again. It should just pick up where it left off.

Yes, this requires a lot of work and care for state in your application. But trust me, it'll be worth it.

Use Feature Flips

New features can cause joy and more headaches. Flickr was one of the first to add something called feature flips, a simple way to enable and disable features for all or only specific users. This way you can throw new features onto your production systems without accidentally enabling it for all users, you can simply allow a small set of users or just your customer to use it and to play with it.

What's more important though, when a feature breaks in production for some reason, you can simply switch it off, disabling traffic on the systems involved, allowing you to take a breether and analyze the problem.

Feature flips come in many flavors, the simplest approach is to just use a configuration file to enable or disable them. Other approaches use a centralized database like Redis for that purpose, which has an added benefit for other parts of your application, but also adds new infrastructure components and therefore, more complexity and more points of failure.

Fail And Degrade Gracefully

What happens when you unplug your database server? Does your application throw in the towel by showing a 500 error, or is it able to deal with the situation and show a temporary page informing the user of what's wrong? You should try it and see what happens.

Whenever something non-critical breaks, your application should be able to deal with it without anything else breaking. This sounds like an impossible thing to do, but it's really not. It just requires care, care your standard unit tests won't be able to deliver, and thinking about where you want a breakage to leak to the user, or where you just ignore it, picking up work again as soon as the failed component becomes available again.

Failing gracefully can mean a lot of things, there's things that directly affect user experience, a database failure comes to mind, and things that the user will notice only indirectly, e.g. through delays in delivering emails or fetching data from an external service like Twitter, RSS feeds and so on.

When a major component in your application fails, a user will most likely be unable to use your application at all. When your database latency increases manifold, you have two options. Try to squeeze through as much as you can, accepting long waits on your user's side, or you can let him know that it's currently impossible to serve him in an acceptable time frame, and that you're actively working on fixing or improving the situations. Which you should, either way.

Delays in external services or asynchronous tasks are much harder for a user to notice. If fetching data from an external source, like an API, directly affects your site's latency, there's your problem.

Noticing problems in external services requires two things: monitoring and metrics. Only by tracking queue sizes, latency for calls to external services, mail queues and all things related to asynchronous tasks will you be able to tell when your users are indirectly affected by a problem in your infrastructure.

After all, knowing is half the battle.

Monitoring Sucks, You Need It Anyway

I've written in abundance on the virtues of monitoring, metrics and alerting. I can't say it enough how important having a proper monitoring and metrics gathering system in place is. It should be by your side from day one of any testing deployment.

Set up alerts for thresholds that seem like a reasonable place to start to you. Don't ignore alerting notifications, once you get into that habit, you'll miss that one important notification that's real. Instead, learn about your system and its thresholds over time.

You'll never get alerting and thresholds right the first time, you'll adapt over time, identifying false negatives and false positives, but if you don't have a system in place at all, you'll never know what hit your application or your servers.

If you're not using a tool to gather metrics like Munin, Ganglia, New Relic, or collectd, you'll be in for a big surprise once your application becomes unresponsive for some reason. You'll simply never find out what the reason was in the first place.

While Munin has basic built-in alerting capabilities, chances are you'll add something like Nagios or PagerDuty to the mix for alerting.

Most monitoring tools suck, you'll need them anyway.

Supervise Everything

Any process that's required to be running at any time needs to be supervised. When something crashes be sure there's an automated procedure in place that will either restart the process or notify you when it can't do so, degrading gracefully. Monit, God, bluepill, supervisord, RUnit, the number of tools available to you is endless.

Micromanaging people is wrong, but processes need that extra set of eyes on them at all times.

Don't Guess, Measure!

Whatever directly affects your users' experience affects your business. When your site is slow, users will shy away from using it, from generating revenue and therefore (usually) profit.

Whenever a user has to wait for anything, they're not willing to wait forever. If an uploaded video takes hours to process, they'll go to the next video hosting site. When a confirmation email takes hours to be delivered, they'll check out your competitor, taking the money with them.

How do you know that users have to wait? Simple, you track how long things in your application take, how many tasks are currently stuck in your processing queue, how long it took to process an upload. You stick metrics on anything that's directly or indirectly responsible for generating business value.

Without having a proper system to collect metrics in place, you'll be blind. You'll have no idea what's going inside your application at any given time. Since Coda Hale's talk "Metrics Everywhere" at CodeConf and the release of his metrics library for Scala, an abundance of libraries for different languages has popped up left and right. They make it easy to include timers, counters, and other types of metrics into your application, allowing you to instrument code where you see fit. Independently, Twitter has lead the way by releasing Ostrich, their own Scala library to collect metrics. The tools are here for you. Use them.

The most important metrics should be easily accessible on some sort of dashboard. You don't need a big fancy screen in your office right away, a canonical place, e.g. a website including the most important graphs and numbers, where everyone can go and see what's going on with a glance is a good start. Once you have that in place, the next step towards a company-visible dashboard is simple buying a big-ass screen.

All metrics should be collected in a tool like Ganglia, Munin or something else. These tools make analysis of historical data easy, they allow you to make predictions or correlate the metrics gathered in your applications to other statistics like CPU, memory usage, I/O waits, and so on.

The importance of monitoring and metrics cannot be stressed enough. There's no reason why you shouldn't have it in place. Setting up Munin is easy enough, setting up collection using an external service like New Relic or Scout is usually even easier.

Use Timeouts Everywhere

Latency is your biggest enemy in any networked environment. It creeps up on you like the shadow of the setting sun. There's a whole bunch of reasons why, e.g. database queries will suddenly see a spike in execution time, or external services suddenly take forever to answer even the simplest requests.

If your code doesn't have appropriate timeouts, requests will pile up and maybe never return, exhausting available resources (think connection pools) faster than Vettel does a round in Monte Carlo.

Amazon for example has internal contracts. Going to their home page involves dozens of requests to internal services. If any one of them doesn't respond in a timely manner, say 300 ms, the application serving the page will render a static piece snippet instead, but thereby decreasing the chance of selling something, directly affecting business value.

You need to treat every call to an external resource as something that can take forever, something that potentially blocks an application server process forever. When an application server process or thread is blocked, it can't serve any other client. When all processes and threads lock up waiting for a resource, your website is dead.

Timeouts make sure that resources are freed and made available again after a grace period. When a database query takes longer than usual, not only does your application need to know how to handle that case, but your database needs to. If your application has a timeout, but your database will happily keep sorting those millions of records in a temp file on disk, you didn't gain a lot. If two dependent resources are within your hands, both need to be aware of contracts and timeouts, both need to properly free resources when the request couldn't be served in a timely manner.

Use timeouts everywhere, but know how to handle them when they occur, know what to tell the user when his request didn't return quickly enough. There is no golden rule what to do with a timeout, it depends not just on your application, but on the specific use case.

Don't Rely on SLAs

The best service fails at some point. It will fail in the most epic way possible, not allowing any user to do anything. This doesn't have to be your service. It can be any service you directly or indirectly rely on.

Say, your code runs on Heroku. Heroku's infrastructure runs on Amazon's EC2. Therefore Heroku is prone to problems with EC2. If a provider like Heroku tells you they have a service level agreement in place that guarantees a minimum amount of availability per month or per year, that's worth squat to you, because they in turn rely on other external services, that may or may not offer different SLAs. This is not specific to Heroku, it's just an obvious example. Just because you outsourced infrastructure doesn't mean you're allowed to stop caring.

If your application runs directly on EC2, you're bound by the same problem. The same is true for any infrastructure provider you rely on, even a big hosting company where your own server hardware is colocated.

They all have some sort of SLA in place, and they all will screw you over with the terms of said SLA. When stuff breaks on their end, that SLA is not worth a single dime to you, even when you were promised to get your money back. It will never make up for lost revenue, for lost users and decreased uptime on your end. You might as well stop thinking about them in the first place.

What matters is what procedures any provider you rely on has in place in case of a failure. The important thing for you as one of their users is to not be left standing in the rain when your hosting world is coming close to an end. A communicative provider is more valuable than one that guarantees an impossible amount of availability. Things will break, not just for you. SLAs give you that false sense of security, the sense that you can blame an outage on someone else.

For more on this topic, Ben Black has written a two part series aptly named "Service Level Disagreements".

Know Your Database

You should know what happens inside your database when you execute any query. Period. You should know where to look when a query takes too long, and you should know what commands to use to analyze why it takes too long.

Do you know how an index is built? How and why your database picks one index over another? Why selecting a random record based on the wrong criteria will kill your database?

You should know these things. You should read "High Performance MySQL", or "Oracle Internals", or "PostgreSQL 9.0 High Performance". Sorry, I didn't mean to say you should, I meant you must read them.

Love Your Log Files

In case of an emergency, a good set of log files will mean the world to you. This doesn't just include the standard set of log files available on a Unix system. It includes your application and all services involved too.

Your application should log important events, anything that may seem useful to analyze an incident. Again, you'll never get this right the first time around, you'll never know up front all the details you may be interested in later. Adapt and improve, add more logging as needed. It should allow you to tune the log verbosity at runtime, either by a using a feature switch or by accepting a Unix signal.

Separate request logging from application logging. Data on HTTP requests is just as important as application logs, but it's easier if you can sift through them independently, they're also a lot easier to aggregate for services like Syslog or Loggly when they're on their own.

For you Rails developers out there: using Rails.logger is not an acceptable logging mechanism. All your logged statements will be intermingled with Rails next to unusable request logging output. Use a separate log file for anything that's important to your application.

Just like you should stick metrics on all things that are important to your business, log additional information when things get out of hand. Correlating log files with metrics gathered on your servers and in your application is an incredibly powerful way of analyzing incidents, even long after they occurred.

Learn the Unix Command Line

In case of a failure, the command line will be your best friend. Knowing the right tools to quickly sift through a set of log files, being able to find and set certain kernel parameters to adjust TCP settings, knowing how get the most important system statistics with just a few commands, and knowing where to look for a specific service's configuration. All these things are incredibly valuable in case of a failure.

Knowing your way around a Unix or Linux system, even with just a basic toolset is something that will make your life much easier, not just in operations, but also as a developer. The more tools you have at your disposal, the easier it will be for you to automate tasks, to not be scared of operations in general.

In times of an emergency, you can't afford to argue that your favorite editor is not installed on a system, you use what's available.

At Scale, Everything Breaks

Working at large scale is nothing anyone should strive for, it's a terrible burden, but an incredibly fascinating one. The need for scalability evolves over time, it's nothing you can easily predict or assume without knowing all the details, parameters and the future. Out of all three, at least one is 100% guess work.

The larger your infrastructure setup gets, the more things will break. The more servers you have, the larger the number of servers being not available at any time. That's nothing you need to respect right from the get go, it's something to keep in mind.

No service that's working at a larger scale was originally designed for it. The code and infrastructure were adapted, the services grew over time, and they failed a lot. Something to think about when you reach for that awesome scalable database before even having any running code.

Embrace Failure

The bottom line of everything is, stuff breaks, everything breaks at different scale. Embrace breakage and failure, it will help you learn and improve your knowledge and skill set over time. Analyze incidents using the data available to you, fix the problem, learn your lesson, and move on.

Don't embrace one thing though: never let a failure happen again if you know what caused it the first time around.

Web operations is not solely related to servers and installing software packages. Web operations involves everything required to keep an application available, and your code needs to play along.

Required Reading

As 101s go, this is a short overview of what I think makes up for a good starter set of operations skills. If you don't believe or trust me (which is a good thing), here's a list of further reading for you. By now, I consider most of these required reading even. The list isn't long, mind you. The truth as of today is still that you learn the most out of personal experience on production systems. Both require one basic skill though: you have to want to learn.


Shameless Plug

If you liked this article, you may enjoy the book I'm currently working on: "The NoSQL Handbook".

Tags: operations
<< Archives | Search | RSS Feed