Monitoring for Humans
Posted on 28 Mar 2013 by Mathias Meyer
Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this conference, I'm far from being a monitoring expert. If anything, I'm a user, a tinkerer of all the great tools we're hearing about at this conference.
I help run a little continuous integration service called Travis CI. For that purpose I built several home-baked things that help us collect metrics and trigger alerts.
I want to start with a little story. I spend quality time at coffee shops and I enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to the big roasting machine they commonly have a laptop with pretty graphs showing how the temperature in the roaster changes over time. On two occasions I found myself telling them: "Hey cool, I like graphs too!"
On the first occasion I looked at the graph and noticed that it'd update itself every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd really love it if it could update every second." In just two seconds the temperature in the roaster can already drop by almost a degree (Celsius), so he was lacking the granularity to get the best insight into his system.
The second roaster did have one second resolution, and I swooned. But I noticed that every minute or so, he wrote down the current temperature on a sheet of paper. The first guy had done that too. I was curious why they'd do that. He told me that he took it as his reference sheet for the next roasting batch. I asked why he didn't have the data stored in the system. He replied that he didn't trust it enough, because if it lost the information he wouldn't have a reference for his next roasting sheet.
He also keeps a set of coffee bean samples around from previous roasts, roasts where the outcome is known to have resulted in a great roasting result. Even coffee roasters have confirmation bias, though to be fully fair, when you're new to the job, any sort of reference can help you move forward.
This was quite curious. They had the technology yet they didn't trust it enough with their data. But heck, they had one-second resolution and they had the technology to measure data from live sensors in real time.
During my first jobs as a developer touching infrastructure, five minute collection intervals and RRDtool graphs were still very much en vogue. My alerts basically came from Monit throwing unhelpful emails at me stating that some process just changed from one state to another.
Since my days with Munin a lot has changed. We went through the era of #monitoringsucks, which fortunately, quickly turned into the era of #monitoringlove. It's been pretty incredible watching this progress as someone who loves tinkering with new and shiny tools and visualization possibilities. We've seen the emergence of crazy new visualization ideas, like the horizon chart, and we've seen the steady rise of using modern web technologies to render charts, while seeing RRDtool being taken to the next level to visualize time series data.
New approaches providing incredibly detailed insight into network traffic and providing stream analysis of time series data have emerged.
One second resolution is what we're all craving, looking at beautiful and constantly updating charts of 95th percentile values.
And yet, how many of you are still using Nagios?
There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.
Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.
There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring something that's easy to get into for people new to the field.
Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts. In the end, you still find yourself looking at one or more graphs trying to figure out what the hell it means.
Tracking metrics has become very popular, thanks to Coda Hale's metrics library, which inspired a whole slew of libraries for all kinds of languages, and tools like StatsD, which made it very easy to throw any kind of metric at them and have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.
Yet the biggest question that I get every time I talk to someone about monitoring, in particular people new to the idea, is: "what should I even monitor?"
With all the tools we have at hand, helping people to find the data that matters for their systems is still among the biggest hurdles that must be conquered to actually make sense of metrics.
Can we do a better job of educating people what they should track, what they could track, and how they can figure out the most important metrics for their system? It took us six months to find the single metric that best reflects the current state of our system. I called it the soul metric, the one metric that matters most to our users and customers.
We started tracking the time since the last build was started and since the last build was finished.
On our commercial platform, where customers run builds for their own products and customer projects, the weekend is very quiet. We only run one tenth of the number of builds on a Sunday compared to a normal weekday. Sometimes we don't run any build in 60 minutes. Suddenly checking when a build was last triggered makes a lot less sense.
Suddenly we're confronted with the issue that we need to look at multiple metrics in the same context to see if a build should even have been started, as the fact itself is solely based on a customer pushing code. We're suddenly looking at measuring the absence of data (no new commits) and correlate it with data derived from several attributes of the system, like no running builds and no build request being processed.
The only reasonable solution I could come up with, and it's mostly thanks to talking to Eric from Papertrail, is if you need to measure something but it require the existence of an activity, you have to make sure this activity is generated on a regular basis.
In hindsight, it's so obvious, though it brings up a question: if the thing that generates the activity fails, does that mean the system isn't working? Is this worth an alert, is this worth waking someone up for? Certainly not.
This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong?
If a coffee roaster doesn't trust his tools enough to give him a consistent insight into the current, past and future roasting batches, isn't that a weird mismatch between humans and the system that's supposed to give them the assurance that they're on the right path?
A roaster still trusts his instincts more than he trusts the data presented to him. After all, it's all about the resulting coffee bean.
Where does that take us and the current state of monitoring?
We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert.
As much as I enjoy staring at graphs, I'd much rather do something more important than that.
I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.
But much more than that, I'd like our monitoring system to be built for humans, reducing the barrier of entry for adding monitoring and metrics to an application and to infrastructure without much hassle. How we'll get there?
Looking at the current state of monitoring, there's a strong focus on technology, which is great, because it helps solves bigger issues like data storage, visualization and presentation, and stream analysis. I'd love to see this all converge on the single thing that has to make the call in the end: a human. Helping them make a good decision and getting there should be very high on our list.
There is a fallacy in this wish though. With more automation comes a cognitive bias to trust what the system is telling me. Can the data presented to me be fully trusted? Did the system actually make the right call in sending me an alert? This is only something a human can figure, just as a coffee roaster needs to trust his instincts even though the variables for every roast are slightly different.
We want to avoid for our users having to have a piece of paper around that tells them exactly what happened the last time this alert was triggered. We want to make sure they don't have to look at samples of beans at different stages to find confirmation for the problem at hand. If the end user always looks at previous samples of data to compare it to the most recent one, the only thing they'll look for is confirmation.
Lastly, the interfaces of the monitoring tools we work with every day are designed to be efficient, they're designed to dazzle with visualization, yet they're still far from being easy to use. If we want everyone in our company to be able to participate in running a system in production, we have to make sure the systems we provide them with interfaces that treat them as what they are: people.
But most importantly, I'd like to see the word spread on monitoring and metrics, making our user interfaces more accessible and tell the tale of how we monitor our systems, how other people can monitor their systems. There's a lot to learn from each other, and I love things like hangops and OpsSchool, they're great starts to get the word out.
Because it's easier to write things down to realize where you are, to figure out where you want to be.
Failure is Always an Option
Posted on 21 Jan 2013 by Mathias Meyer
Failure is still one of the most undervalued things in our business, in most businesses really. We still tend to point fingers elsewhere, blame the other department, or try anything to cover our asses.
How about we do something else instead? We embrace failure openly, turn it into our company's culture and do everything we can to make sure every failure is turned into a learning experience, into an opportunity?
Let me start with some illustrating examples.
Wings of Fury
In 2010, Boeing tested the wings of a brand new 787 Dreamliner. In a giant hangar, they set up a contraption that'd pull the wings of a 787 up, with so much pull that the wings were bound to break.

Eventually, and after they've been flexed upwards of 25 feet, the wings broke spectacularly.
The amazing bit: all the engineers watching it happen started to cheer and applaud.
Why? Because they anticipated the failure at the exact circumstances where it broke, at about 150% of what wings handle at normal operation.
They can break things loud and proud, they can predict when their engineering work falls apart. Can we do the same?
Safety first
I've been reading a great book, "The Power of Habit", and it outlines another story of failure and how tackling that was turned into an opportunity to improve company culture.
When Paul O'Neill, later to become Secretary of the Treasury, took over management of Alcoa, one of the United States' largest aluminum production companies, he made it his first and foremost to tackle the safety issues in the company's production plants.
He put rules in place that any accidents must be reported to him within just a few hours, including remedies on how this kind of accident will be prevented in the future.
While his main focus was to prevent failures, because they would harm or even kill workers, what he eventually managed to do is to implement a company culture where even the smallest suggestions to improve safety or to improve efficiency from any worker would be considered and would be handed up the chain of management.
This fostered a culture of highly increased communication between production plants, between managers, between workers.
Failures and accidents still happened, but were in sharp decline, as every single one was taken as an opportunity to learn and improve the situation to prevent them from happening again.
It was a chain of post-mortems if you will. O'Neill's interest was to make everyone part of improving the overall situation without having to fear blame. Everyone was made felt like they're an important part of the company. By then, 15000 people worked at Alcoa.
This had an interesting effect on the company. In twelve years, O'Neill managed to increase Alcoa's revenues from $1.5 to $23 billion dollars.
His policies became an integral part of the company's culture and ensured that everyone working for it felt like an integral part of the production chain.
Floor worker's were given permission to shut down the production chain if they deemed it necessary and were encouraged to whistle when they noticed even the slightest risk in any activity in the company's facilities.
To be quite fair, competitors were pretty much in the dark about these practices, which gave Alcoa a great advantage on the market.
But within a decade of running the company, he transformed it into a culture that sounds strikingly similar to the ideas of DevOps. He managed to make everyone feel responsible for delivering a great product and for everyone to be enabled to take charge should something go wrong.
All that is based on the premise of trust. Trust that when someone speaks up, they will be taken seriously.
Three Habits of Failure
If you look at the examples above, some patterns come up. There are companies outside of our field that have mastered or at least taken on an attitude of accepting that failure is inevitable, anticipating failure and dealing with and learning from failure.
Looking at some more examples it occurred to me that even doing one of these things will improve your company's culture significantly.
How do we fare?
We fail, a lot. It's in the nature of the hardware we use and the software we build. Networks partition, hard drives fail, software bugs creep into system that can lead to cascading failures.
But do we, as a community, take enough of advantage of what we learn from each outage?
Does your company hold post-mortem meetings after a production outage? Do you write public post-mortems for your customers?
If you don't, what's keeping you from doing so? Is it fear of giving your competitors an advantage? Is it fear of giving away too many internal details? Fear of admitting fault in public?
There's a great advantage in making this information public. Usually, it doesn't really concern your customers what happened in all detail. What does concern them is knowing that you're in control of the situation.
A post-mortem follows three Rs: regret, reason and remedy.
They're a means to say sorry to your customers, to tell them that you know what caused the issues and how you're going to fix them.
On the other hand, post-mortems are a great learning opportunity for your peer ops and development people.
Web Operations
This learning is an important part of improving the awareness of web operations, especially during development. There's a great deal to be learned from other people's experiences.
Web operations is a field that is mostly learning by doing right now. Which is an important part of the profession, without a doubt.
If you look at the available books, there are currently three books that give insight into what it means to build and run reliable and scalable systems.
"Release It!", "Web Operations" and "Scalable Internet Architectures" are the ones that come to mind.
My personal favorite is "Release It!", because it raises developer awareness on how to handle and prevent production issues in code.
It's great to see the circuit breaker and the bulkhead pattern introduced in this book now being popularized by Netflix, who openly write about their experiences implementing it.
Netflix is a great example here. They're very open about what they do, they write detailed post-mortems when there's an outage. You should read their engineering blog, same for Etsy's.
Why? Because it attracts engineering talent.
If you're looking for a job, which company would you rather work for? One that encourages taking risks while also taking responsibility for fixing issues when failure does come up, and one that enables a culture of fixing and improving issues as a whole rather than to put blame?
I'd certainly choose the former.
Over the last two years, Amazon has also realized how important this is. Their post-mortems have gotten very valuable for anyone interest in things that can happen in multi-tenant, distributed systems.
If you remember the most recent outage on Christmas Eve, they even had the guts to come out and say that production data was deleted by accident.
Can you imagine the shame these developers must feel? But can you imagine a culture where the issue itself is considered an opportunity to learn instead of blaming or firing you? If only to learn that accessing production data needs stricter policies.
It's a culture I'd love to see fostered in every company.
Regarding ops education, there have been some great things last year that are worth mentioning. hangops is a nice little circle, streamed live (mostly) every Friday, and available for anyone to watch on YouTube afterwards.
Ops School has started a great collection of introductory material on operations topics. It's still very young, but it's a great start, and you can help move it forward.
Travis CI
At Travis CI, we're learning from failure, a lot. As a continuous integration platform, it started out as a hobby project and was built with a lot of positive assumptions.
It used to be a distributed system that always assumed everything would work correctly all the time.
As we grew and added more languages and more projects, this ideal fell apart pretty quickly.
It is a symptom of a lot of projects that are developer-driven, because there's just so little public information on how to do it right, on how distributed systems are built and run at other companies for them to work reliably.
We decided to turn every failure into an opportunity to share our learnings. We're an open source project, so it only makes sense to be open about our problems too.
Our audience and customers, who are mostly developers themselves, seem to appreciate that. I for one am convinced that we owe to them.
I encourage you to do the same, to share details on your development, on how you run your systems. It'll be surprising how introducing these changes can affect working as a team as a whole.
Cultural evolution
This insight didn't come easy. We're a small team, and we were all on board with the general idea of openness about our operational work and about the failures in our system.
That openness brings with it the need to own your systems, to own your failures. It took a while for us to get used to working together as a team to get these issues out of the way as quickly as possible and to find a path for a fix.
In the beginning, it was still too easy to look elsewhere for the cause of the problem. Blame is one side of the story, hindsight bias is the other. It's too easy to point out that the issue has been brought up in the past, but that doesn't contribute anything to fixing it.
The more helpful attitude than saying "I've been saying this has been broken for months" is to say "Here's how I'll fix it." You own your failures.
The only thing that matters is delivering value to the customer. Putting aside blame and admitting fault while doing everything you can to make sure the issue is under control is, in my opinion, the only way how you can do that, with everyone in your company on board.
Accepting this might just help transform your company's culture significantly.
This essay is an extended version of a talk I gave at Paperless Post about coffee and customer happiness. While the talk was originally titled "Coffee and the Art of Software Maintenance", I figured that customer happiness is overall a much more fitting for the topic.
For coffee, maintaining and improving your craft and making customers happy are two means to the same end: to have loyal customers who tell their friends about you.
Geeks everywhere!
I'm a coffee geek, and I spent a lot of time in coffee shops. But rather than spend it on my laptop, writing code, I spend the time watching and talking to the fine people making my coffee, the baristas.
Baristas are geeks, just like we are. They love talking about the latest toys, about which espresso machine is better than the other, they compare paper filters with cloth, and they take detailed notes on the different aromas of coffee when they’re cupping it.
The craft of coffee making is quite fascinating, both from the perspective of precision and customer care. But let's start with a little story.
In June 2010 I had the pleasure of visiting a rather special coffee shop. The London roaster Square Mile had opened a popup shop that only served filter coffee. No milk beverages, not even espresso. Just filter coffee.
It was called Penny University.
The greatest coffee shop in the world
The shop consisted of a bar and six stools. It offered a very simple menu, with three different kinds of coffee served at any time. Every coffee was brewed using a different technique and served with a piece of chocolate matching the taste of the coffee.
For instance, the Yirgacheffe from Ethiopia was brewed with a Hario V60, which so happens to bring out its delicate and sometimes lemony flavours. It was served with a piece of chocolate that also had a lemon flavor.
You could either choose to have just a single brew or to try all three varieties in a three course menu. The latter would require you to sit in for 30 minutes with the barista giving you his full attention, explaining flavors, origin and the brewing technique.
It was one of the greatest coffee experiences I've had so far. The setting, the barista, the attention to detail, the barista's focus on delivering the best possible value, it all added up to something very special and unique.
As I later found out, I was served by the owner of Square Mile, 2007 World Barista Champion James Hoffman.
Sadly, the shop closed after three months.
Meanwhile, in Berlin
As if by coincidence, after that the coffee scene in Berlin started to take of. Since then, I've had the pleasure of hanging out with a lot of fine baristas from all over the world chatting about coffee, all in the comfort of my hometown. Especially at The Barn, a shop that opened around the same time, I learned to appreciate to precise finesse of making coffee. It's a downward spiral.
At some point what I've learned started having affects in what I do for a living, build and run software, and making customers happy by providing them with the best possible value.
Each necessary, but only jointly sufficient
Let's look at precision and what makes a good cup of coffee.
While a good of coffee is a subjective experience, a barista strives for one thing: to make every cup of coffee as great as the next.
To achieve that goal, every variance must be removed. Every step of the brew process must be subject to the same conditions.
This is truly an art, though it sounds surprisingly boring, as the ultimate goal is to have a process that's repeatable every single time. Consistency is a barista’s prime directive.
The variables start with hardness of water, involve finding the right coffee grind setting, which varies from coffee to coffee, to making sure the temperature of the water is always the same.
Add to that water flow, circulation and agitation of coffee grounds during the brew, measuring the water used to brew (water has a different weight when it's hot compared to when it's cold), weighing the coffee beans and timing the whole brew.
Of course every variable can be different depending on what brew method is used for the coffee.
A barista has to make sure he can measure every single variable to make sure the brewing conditions are the same every time. This is true both for espresso and filter coffee. Plus, every variable can vary depending on the coffee bean, the roast, and its origin.
If he needs to change something, he can only change one variable at a time to make an informed decision on whether the change had a positive or a negative impact on the resulting brew.
Changing only one variable can have terrible results, leading to a less enjoyable result. Grind the coffee beans too coarse, and the coffee will have less taste, it's under-extracted.
Use too little water, and the coffee will be over-extracted. Choose a temperature that's too hot, and the coffee will be less enjoyable, and the customer will have to wait for it to cool down. Use boiling water and you might kill some of the flavors that make the coffee at hand so unique.
You'll find these conditions mostly in the really good coffee shops out there, where people care about their craft. The Starbucks around the corner will make you a latte that burns your tongue, which is unacceptable to what I'd consider a professional barista.
Does all that sound familiar?
Metrics, metrics everywhere!
Over the last two years or so we've seen the operational trend to measure everything. Every variable that can change when code is running in production is measured over time.

Only one variable changing at runtime can have catastrophic results on the whole software, possibly leading to cascading failures or triggering other bugs in the code that have remained undetected so far. Metrics and measuring give you the insurance that if something goes wrong, if something goes off the normal flow, you will notice it immediately.
The same is true for changing code. I find it particularly hard to change code without knowing how it currently behaves in production. Just like with brewing coffee, changing multiple parts of a certain feature at once can lead to behavior that’s hard to reason about.
I prefer doing single changes at a time to see how they behaves in isolation. Rather than seeing this as a restriction because of fear of breaking things, I see that as a culture of introducing a single seam at a time to see if it breaks or not. Breaking one thing at a time is much preferable to breaking many.
The important bit is that a company's culture needs to ensure that teams can iterate around these smaller changes quickly, continuously monitoring how they behave in production.
Continuous Coffee Delivery
It's the equivalent of a barista shipping dozens if not hundreds of cups coffee per day. It’s continuous delivery, a culture fully embraced by the barista at your favorite coffee shop. There can be tiny variances in every single cup, but the barista focuses on keeping them as small as possible and on changing only one thing at a time to be able to get measurements on its effects quickly.
I’ve seen baristas taste my brew before serving it, always ready to chuck it and make a fresh one from scratch, should the end result not satisfy their own quality standards. A smoke test, if you will. It's a great little detail that looks odd at first but makes a lot of sense when you know how many variables are involved.
To round things off, a good barista practices every day. A few dry runs before opening shop and after make sure that variations in the coffee bean are continuously evened out by adapting the brewing process. As coffee beans deteriorate over time (usually a few days to a few weeks) they get drier, and they need a different grind setting.
Of course, this also involves learning new tools, new brewing techniques, choosing the one best applied for a particular brewing method.
I've been surprised many times to how similar all this is to our own work, to writing, shipping and running code.
Talk that talk
It’s fun and interesting to talk to baristas about their work. I've found a lot of them to be happy to share details about what they're doing and why, and they seem to be just as happy to know that there are people who are not just interested in a good cup of joe, but also in how it came to be. They're passionate about their work, just as you are about your code.
Talk to them long enough and they'll think you're working in coffee too. It's pretty fun, it's the equivalent of your customer talking to you about the nuances of concurrency in different programming languages.
It's something that's easy to forget when you spend most of your time with people doing similar work as you do. Compared to a barista, you're just brewing code instead of coffee.
It’s great to talk to other people who are passionate about their work and providing the best value for their customers. It's reaffirming that you're on the right track when you realize that other professions follow similar philosophies.
There's another variable that I have yet to mention: the coffee bean itself. A lot of coffee shops, unsatisfied with the coffee they got from other sources, start looking into roasting their own. They want to take that last variable out of the equation that's under someone else's control.
Plan to throw one (hundred kilos) away
Unfortunately, roasting coffee opens a whole new can of worms. Just like it takes time to find the right values for brewing coffee, you need to find the right temperature and roasting time coffee for every single coffee bean.
To get there, lots of coffee gets thrown away. A coffee shop in Berlin recently started roasting, and they went through several hundred kilos of green beans before they came up with a satisfying end result. Let me tell you that the end result is pretty spectacular.
What they basically apply here is rapid prototyping. They iterate around several bags of coffee to find the right conditions to extract the best possibles aroma from the bean.
It sounds insane to throw away all that coffee, but it has to be to make sure the customer gets the best possible value when buying it.
This is why specialty coffee is more expensive than your bag of Starbucks or the coffee you buy at the supermarket. The value for the person enjoying it is a lot higher as there's a lot more to be experienced than just black coffee.
Unsurprisingly, even bad coffee is these days sold for a premium. When you extrapolate K-cups to the volume of a single bag of Cafe Grumpy beans, you end up paying the same or even more.
The value proposition is convenience. The overall experience is worse than when controlling all the brewing steps yourself, but at least you can be sure to get a cup of coffee quickly.
The craft of coffee has a lot of similarities to software development and maintenance. It's a gradual process, with lots of learning and experience involved.
When you run a coffee shop, there comes the time when roasting yourself is the only option, because you want to have control over everything or because the coffee you buys elsewhere is below your quality standards. Or simply because it's more convenient to do everything in-house.
That's like eventually writing your own custom software components or starting to own your infrastructure more and more over time. You need the control to ensure the best possible service to your customers. It means more work on your end, but if it can ensure that your customers are happy, it's well worth the effort.
Coffee is a personal experience
The one thing that I admire the most about baristas is that they're close to the customer all the time. The customer can follow along every step her coffee takes to get into her hands.
The customer is free to talk to the barista along the process, and most baristas are more than willing to share their insight, what the coffee tastes like and where it came from.
At some Intelligentsia shops, you're even assigned your personal barista that takes you to the entire process of making your coffee. I'm very much in love with that idea. If you stretch that idea to running an internet business it's similar to having a single support person that's taking you through the lifetime of a ticket. As a customer you know that the person on the other hand will know all the details about the issue at hand. It makes the whole experience of customer service a lot more personal.
I went to a coffee shop in Toronto and asked the barista about their favorite coffee, which I commonly do when I'm presented with a lot of choices I haven't tried before. I ended up with a rather dark Sumatran brew from the Clover, one of the greatest technical coffee inventions of all time, sadly they were bought by Starbucks, and it was a bit too dark for my taste.
As a courtesy, she offered me to get another brew, on the house of course. She took charge of her recommendation not meeting my taste and offered me something else for free.
This face-to-face communication also makes it harder to be angry about something. It's still possible, but it's also a lot easier to react to an angry customer when he's right in front of you. If it happens, you offer a free beverage.
Customer experience trumps everything else
That's one of my biggest learnings of the last year, and I have my favorite coffee shops to thank for the inspiration. Personal customer experience trumps everything else, even for a business that's solely accessed through the internet.
You could think that a barista telling you all about their secrets or how to brew excellent coffee will make you stay at home and start making your own coffee all the time.
And so you will. But you will keep coming back because the barista knows you by name, because they learn your taste in coffee, because they give you free samples, because they let you try new coffees first.
That kind of experience is priceless.
A lot of coffee shops have customer loyalty cards. You get a stamp for every coffee and the next coffee is free. I think those loyalty cards are great, and I'm contemplating how they could be applied to internet businesses.
But consider this: instead of knowing that your next coffee will be free, a barista randomly gives you free drinks, new coffee blends, an extra shot of espresso.
Without expecting that next coffee to be free, your happiness levels will be infinitely higher. It's something that I found to make for even more loyal customers and to give them an overall much more personal experience. The surprise trumps every single stamp on your loyalty card.
It's one of the reasons why we send each of our customers a bag of coffee beans. It seems so unrelated to our business, but all of us care about good coffee. And what makes it for the customer is the surprise, them not expecting anything like that from an internet business.
It's also why MailChimp sent out almost 30000 t-shirts last year. After you've successfully launched your first campaign, they send an email to congratulate you and offer to send you a t-shirt. A great and unexpected gesture of customer love. It's worth noting that the shirts are of a great quality, which definitely adds to the surprise.
The similarities of running a coffee shop to running an online business and maintaining software are pretty striking, and you'd think that's only natural, as lots of crafts and running a business are very similar.
Yet the subtleties are what makes every single one of them special, and it's worth looking at them in more detail to see if you can improve your own skills based on the gained knowledge or if you can improve your business' customer relationship efforts.
Both the precision and the customer experience of a good barista and a great coffee shop are something that value one thing: the best possible value for a customer, a great cup of coffee. If you can get one cup of coffee right and make a customer happy, they'll come again, and again, and again.
Getting a customer to stick around, turning them into your most loyal customer, that's the best thing any business, any developer building a customer-facing product can ask for.
The Virtues of Monitoring, Redux
Posted on 10 Jan 2013 by Mathias Meyer
Two years ago, I wrote about the virtues of monitoring. A lot has changed, a lot has improved, and I've certainly learned a lot since I wrote that initial overview on monitoring as a whole.
There have been a lot of improvements to existing tools, and new players entered the market of monitoring. Infrastructure as a whole got more and more interesting for service business around them.
On the other hand, awareness for monitoring, good metrics, logging and the like has been rising significantly.
At the same time #monitoringsucks raised awareness that a lot of monitoring tools are still stuck in the late nineties when it comes to user interface and the way they work.
Independent of new and old tools, I've had the pleasure of learning a lot more about the real virtues of monitoring, about how it affects daily work and how it evolves over time. This post is about discussing some of these insights.
Monitoring all the way down
When you start monitoring even just small parts of an application, the need for more detail and for information about what's going on in a system arises quickly. You start with an innocent number of application level metrics, add metrics for database and external API latencies, start tracking system level and business metrics.
As you add monitoring to one layer of the system, the need to get more insight into the layer below comes up sooner or later.
One layer has just been tackled recently in a way that's accessible for anyone: communication between services on the network. Boundary has built some pretty cool monitoring stuff that gives you incredibly detailed insight into how services talk to each other, by way of their protocol, how network traffic from inside and outside a network develops over time, and all that down to the second.
The real time view is pretty spectacular to behold.

If you go down even further on a single host, you get to the level where you can monitor disk latencies.
Or you could measure the effect of screaming at a disk array of a running system. dtrace is a pretty incredible tool, and I hope to see it spread and become widely available on Linux systems. It allows you to inject instrumentation into arbitrary parts of the host system, making it possible measure any system call without a lot of overhead.
Heck, even our customer support tool allows us to track metrics for response times, how many tickets and for how long each staff member handled.

It's easy to start obsessing about monitoring and metrics, but there comes a time, when you either realize that you've obsessed for all the right reasons, or you add more monitoring.
Mo' monitoring, mo' problems
The crux of monitoring more layers of a system is that with more monitoring, you can and will detect more issues.
Consider Boundary, for example. It gives you insight into a layer you haven't had insight before, at least not at that granular level. For example, round trip times of liveness traffic in a RabbitMQ cluster.

This gives you a whole new pile of data to obsess about. It's good because that insight is very valuable. But it requires more attention, and more issues require investigation.
You also need to learn how a system behaving normally is reflected in those new systems, and what constitutes unusual behaviour. It takes time to learn and to interpret the data correctly.
In the long run though, that investment is well worth it.
Monitoring is an ongoing process
When we started adding monitoring to Travis CI, we started small. But we quickly realized what metrics really matter and what parts of the application and the infrastructure around it needs more insight, more metrics, more logging.
With every new component deployed to production, new metrics need to be maintained, more logging and new alerting need to be put in place.
The same is true for new parts of the infrastructure. With every new system or service added, new data needs to be collected to ensure the service is running smoothly.
A lot of the experience of what metrics are important there and which aren't, it's something that develops over time. Metrics can come and go, the requirements for metrics are subject to change, just as they are for code.
As you add new metrics, old metrics might become less useful, or you need more metrics in other parts of the setup to make sense of the new ones.
It's a constant process of refining the data you need to have the best possible insight into a running system.
Monitoring can affect production systems
The more data you collect, with higher and higher resolution, the more you run the risk of affecting a running system. Business metrics regularly pulled from the database can become a burden on the database that's supposed to serve your customers.
Pulling data out of running systems is a traditional approach to monitoring, one that's unlikely to go away any time soon. However, it's an approach that's less and less feasible as you increase resolution of your data.
Guaranteeing that this collection process is low on resources is hard. It's even harder to get a system up and running that can handle high-resolution data from a lot of services sent concurrently.
So new approaches have started to pop up to tackle this problem. Instead of pulling data from running processes, the processes themselves collect data and regularly push it to aggregation services which in turn send the data to a system for further aggregation, graphing, and the like.
StatsD is without a doubt the most popular one, and it has sparked a ton of forks in different languages
Instead of relying on TCP with its long connection handshakes and timeouts, StatsD uses UDP. The processes sending data to it stuff short messages into a UDP socket without worrying about whether or not the data arrives.
If some data doesn't make it because of network issues, that only leaves a small dent. It's more important for the system to serve customers than for it to wait around for the aggregation service to become available again.
While StatsD solves the problem of easily collecting and aggregating data without affecting production systems, there's now the problem of being able to inspect the high-resolution data in meaningful ways. Historical analysis and alerting on high-resolution data becomes a whole new challenge.
Riemann has popularized looking at monitoring data as a stream, to which you can apply queries, and form reactions based on those queries. You can move the data window inside the stream back and forth, so you can compare data in a historical context before deciding on whether it's worth an alert or not.
Systems like StatsD and Riemann make it a lot easier for systems to aggregate data without having to rely on polling. Services can just transmit their data without worrying much about how and where they're used for other purposes like log aggregation, graphing or alerting.
The important realization is that with increasing need for scalability and distributed systems, software needs to be built with monitoring in mind.
Imagine RabbitMQ that instead of you having to poll the data from it, sends its metrics as a message at a configurable interval to a configurable fanout. You can choose to consume the data and submit it to a system like StatsD or Riemann, or you can ignore it and the broker will just discard the data.
Who's monitoring the monitoring?
Another fallacy of monitoring is that it needs to be reliable. For it to be fully reliable it needs to be monitored. Wait, what?
Every process that is required to aggregate metrics, to trigger alerts, to analyze logs needs to be running for the system to work properly.
So monitoring in turns needs its own supervision to make sure it's working at all times. As monitoring grows it requires maintenance and operations to take care of it.
Which makes it a bit of a burden for small teams.
Lots of new companies have sprung into life serving this need. Instead of having to worry about running services for logs, metrics and alerting by themselves, it can be left to companies who are more experienced in running them.
Librato Metrics, Papertrail, OpsGenie, LogEntries, Instrumental, NewRelic, DataDog, to name a few. Other companies take the burden of having to run your own Graphite system away from you.
It's been interesting to see new companies pop up in this field, and I'm looking forward to seeing this space develop. The competition from the commercial space is bound to trigger innovation and improvements on the open source front as well.
We're heavy users of external services for log aggregation, collecting metrics and alerting. Simply put, they know better how to run that platform than we do, and it allows us to focus on delivering the best possible customer value.
Monitoring is getting better

Lots of new tools have sprung up in the last two years. While development on it started earlier than that, the most prominent tools are probably Graphite and Logstash. Cubism brings new ideas on how to visualize time series data, one of the several dozens of dashboards that Graphite's existence and flexibility by offering an API has sparked. Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.
It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.
I'm looking forward to seeing how the monitoring space evolves over the next two years.
On Pager Duty
Posted on 02 Jan 2013 by Mathias Meyer
Over the last year, as we started turning Travis CI into a hosted product, we added a ton of metrics and monitoring. While we started out slow, we soon figured out which metrics are key and which are necessary to monitor the overall behavior of the system.
I built us a custom collector that rakes in metrics from our database and from the API exposed by RabbitMQ. It soon dawned on me that these are our core metrics, and that they need not only graphs, we need to be alerted when they cross thresholds.
The first iteration of that dumped alerts into Campfire. Given that we're a small team and the room might be empty at times, that was just not sufficient for an infrastructure platform that's used by customers and open source projects around the world, at any time of the day.
So we added alerting, by way of OpsGenie. It's set up to trigger alerts via iPhone push notifications and escalations via SMS, should an alert not have been acknowledged or closed within 10 minutes. Eventually, escalation needs to be done via voice calls so that someone really picks up. It's easy to miss a vibrating iPhone when you're sound asleep, but much harder so when it keeps on vibrating until someone picks up.
A Pager for every Developer
Just recently I read an interview with Werner Vogels on architecture and operations at Amazon. He said something that struck with me: "You build it, you run it."
That got me thinking. Should developers of platforms be fully involved in the operations side of things?
A quick survey on Twitter showed that there are some companies where developers are paged when there are production issues, others fully rely on their operations team.
There's merit to both, but I could think of a few reasons why developers should be carrying a pager just like operations does.
You stay connected to what your code does in production. When code is developed, the common tool to manage expectations is to write tests. Unfortunately, no unit test, no integration test will be fully able to reproduce circumstances of what your code is doing in production.
You start thinking about your code running. Reasoning about what a particular piece of code is doing under specific production circumstances is hard, but not entirely impossible. When you're the one responsible for having it run smoothly and serve your customers, this goes up to a whole new level.
Metrics, instrumentation, alerting, logging and error handling suddenly become a natural part of your coding workflow. You start making your software more operable, because you're the one who has to run it. While software should be easy to operate in any circumstances, it commonly isn't. When you're the one having to deal with production issues, that suddenly has a very different appeal.
Code beauty is suddenly a bit less important than making sure your code can treat errors, timeouts, increased latencies. Kind of an ironic twist like that. Code that's resilient to production issues might not have a pretty DSL, it might not be the most beautiful code, but it may be able to sustain whatever issue is thrown at it.
Last, when you're responsible for running things in production, you're forced to learn about the entire stack of an application, not just the code bits, but its runtime, the host system, hardware, network. All that turns into something that feels a lot more natural over time.
I consider that a good thing.
There'll always be situations where something needs to be escalated to the operations team, with deeper knowledge of the hardware, network and the like. But if code breaks in production, and it affects customers, developers should be on the front of fixing it, just like the operations team.
Even more so for teams that don't have any operations people on board. At some point, a simple exception tracker just doesn't cut it anymore, especially when no one gets paged on critical errors.
Being On Call
For small teams in particular, there's a pickle that needs to be solved: who gets up in the middle of the night when an alert goes off?
When you have just a few people on the team, like your average bootstrapping startup, does an on call schedule make sense? This is something I haven't fully figured out yet.
We're currently in the fortunate position that one of our team members is in New Zealand, but we have yet to find a good way to assign on call when he's out or for when he's back on this side of the world.
The folks at dotCloud have written about their schedule, thank you! Hey, you should share your pager and on-call experiences too!
Currently we have a first come first serve setup. When an alert comes in and someone sees it, it gets acknowledged and looked into. If that involves everyone coming online, that's okay for now.
However, it's not an ideal setup, because being able to handle an alert means being able to log into remote systems, restart apps, inspect the database, look at the monitoring charts. Thanks to iPhone and iPad most of that is already possible today.
But to be fully equipped to handle any situation, it's good to have a laptop at hand.
This brings up the question: who's carrying a laptop and when? Which in turns means that some sort of on-call schedule is still required.
We're still struggling on this, so I'd love to read more about how other companies and teams handle that.
Playbooks
During a recent hangops discussion, there was a chat about developers being on call. It brought up an interesting idea, a playbook on how to handle specific alerts.
It's a document explaining things to look into when an alert comes up. Ideally, an alert already includes a link to the relevant section in the book. This is something operations and developers should work on together to make sure all fronts are covered.
It takes away some of the scare of being on call, as you can be sure there's some guidance when an issue comes up.
It also helps refine monitoring and alerts and make sure there are appropriate measures available to handle any of them. If there are not, that part needs improving.
I'm planning on building a playbook for Travis as we go along and refine our monitoring and alerts, it's a neat idea.
Sleepless in Seattle
There's a psychological side to being on-call that needs a lot of getting used to: the thought that an alert could go off at any time. While that's a natural thing, as failures do happen all the time, it's easy to mess up your head. It certainly did that for me.
Lying in bed, not being able to sleep, because your mind is waiting for an alert, it's not a great feeling. It takes getting used to. It's also why having an on-call schedule is preferable over an all hands scenario. When only one person is on call, team mates can at least be sure to get a good night's sleep. As the schedule should be rotating, everyone gets to have that luxury on a regular basis.
It does one thing though: it pushes you to make sure alerts only go off for relevant issues. Not everything needs to be fixed right away, some issues could be taken care of by improving the code, others are only temporary fluxes because of increased network latency and will resolve themselves after just a few minutes. Alerting someone on every other exception raised doesn't cut it anymore, alerts need to be concise and only be triggered when the error is severe enough and affects customers directly. Getting this right is the hard part, and it takes time.
All that urges you to constantly improve your monitoring setup, to increase relevance of alerts, and to make sure that everyone on the team is aware of the issues, how they can come up and how they can be fixed.
It's a good thing.






