This essay is an extended version of a talk I gave at Paperless Post about coffee and customer happiness. While the talk was originally titled "Coffee and the Art of Software Maintenance", I figured that customer happiness is overall a much more fitting for the topic.

For coffee, maintaining and improving your craft and making customers happy are two means to the same end: to have loyal customers who tell their friends about you.

Geeks everywhere!

I'm a coffee geek, and I spent a lot of time in coffee shops. But rather than spend it on my laptop, writing code, I spend the time watching and talking to the fine people making my coffee, the baristas.

Baristas are geeks, just like we are. They love talking about the latest toys, about which espresso machine is better than the other, they compare paper filters with cloth, and they take detailed notes on the different aromas of coffee when they’re cupping it.

The craft of coffee making is quite fascinating, both from the perspective of precision and customer care. But let's start with a little story.

In June 2010 I had the pleasure of visiting a rather special coffee shop. The London roaster Square Mile had opened a popup shop that only served filter coffee. No milk beverages, not even espresso. Just filter coffee.

It was called Penny University.

Penny University

The greatest coffee shop in the world

The shop consisted of a bar and six stools. It offered a very simple menu, with three different kinds of coffee served at any time. Every coffee was brewed using a different technique and served with a piece of chocolate matching the taste of the coffee.

For instance, the Yirgacheffe from Ethiopia was brewed with a Hario V60, which so happens to bring out its delicate and sometimes lemony flavours. It was served with a piece of chocolate that also had a lemon flavor.

Penny University

You could either choose to have just a single brew or to try all three varieties in a three course menu. The latter would require you to sit in for 30 minutes with the barista giving you his full attention, explaining flavors, origin and the brewing technique.

It was one of the greatest coffee experiences I've had so far. The setting, the barista, the attention to detail, the barista's focus on delivering the best possible value, it all added up to something very special and unique.

As I later found out, I was served by the owner of Square Mile, 2007 World Barista Champion James Hoffman.

Sadly, the shop closed after three months.

Meanwhile, in Berlin

As if by coincidence, after that the coffee scene in Berlin started to take of. Since then, I've had the pleasure of hanging out with a lot of fine baristas from all over the world chatting about coffee, all in the comfort of my hometown. Especially at The Barn, a shop that opened around the same time, I learned to appreciate to precise finesse of making coffee. It's a downward spiral.

At some point what I've learned started having affects in what I do for a living, build and run software, and making customers happy by providing them with the best possible value.

Coffee time at The Barn

Each necessary, but only jointly sufficient

Let's look at precision and what makes a good cup of coffee.

While a good of coffee is a subjective experience, a barista strives for one thing: to make every cup of coffee as great as the next.

To achieve that goal, every variance must be removed. Every step of the brew process must be subject to the same conditions.

This is truly an art, though it sounds surprisingly boring, as the ultimate goal is to have a process that's repeatable every single time. Consistency is a barista’s prime directive.

The variables start with hardness of water, involve finding the right coffee grind setting, which varies from coffee to coffee, to making sure the temperature of the water is always the same.

Add to that water flow, circulation and agitation of coffee grounds during the brew, measuring the water used to brew (water has a different weight when it's hot compared to when it's cold), weighing the coffee beans and timing the whole brew.

Of course every variable can be different depending on what brew method is used for the coffee.

A barista has to make sure he can measure every single variable to make sure the brewing conditions are the same every time. This is true both for espresso and filter coffee. Plus, every variable can vary depending on the coffee bean, the roast, and its origin.

If he needs to change something, he can only change one variable at a time to make an informed decision on whether the change had a positive or a negative impact on the resulting brew.

Changing only one variable can have terrible results, leading to a less enjoyable result. Grind the coffee beans too coarse, and the coffee will have less taste, it's under-extracted.

Use too little water, and the coffee will be over-extracted. Choose a temperature that's too hot, and the coffee will be less enjoyable, and the customer will have to wait for it to cool down. Use boiling water and you might kill some of the flavors that make the coffee at hand so unique.

You'll find these conditions mostly in the really good coffee shops out there, where people care about their craft. The Starbucks around the corner will make you a latte that burns your tongue, which is unacceptable to what I'd consider a professional barista.

Does all that sound familiar?

Metrics, metrics everywhere!

Over the last two years or so we've seen the operational trend to measure everything. Every variable that can change when code is running in production is measured over time.

Only one variable changing at runtime can have catastrophic results on the whole software, possibly leading to cascading failures or triggering other bugs in the code that have remained undetected so far. Metrics and measuring give you the insurance that if something goes wrong, if something goes off the normal flow, you will notice it immediately.

The same is true for changing code. I find it particularly hard to change code without knowing how it currently behaves in production. Just like with brewing coffee, changing multiple parts of a certain feature at once can lead to behavior that’s hard to reason about.

I prefer doing single changes at a time to see how they behaves in isolation. Rather than seeing this as a restriction because of fear of breaking things, I see that as a culture of introducing a single seam at a time to see if it breaks or not. Breaking one thing at a time is much preferable to breaking many.

The important bit is that a company's culture needs to ensure that teams can iterate around these smaller changes quickly, continuously monitoring how they behave in production.

Continuous Coffee Delivery

It's the equivalent of a barista shipping dozens if not hundreds of cups coffee per day. It’s continuous delivery, a culture fully embraced by the barista at your favorite coffee shop. There can be tiny variances in every single cup, but the barista focuses on keeping them as small as possible and on changing only one thing at a time to be able to get measurements on its effects quickly.

I’ve seen baristas taste my brew before serving it, always ready to chuck it and make a fresh one from scratch, should the end result not satisfy their own quality standards. A smoke test, if you will. It's a great little detail that looks odd at first but makes a lot of sense when you know how many variables are involved.

To round things off, a good barista practices every day. A few dry runs before opening shop and after make sure that variations in the coffee bean are continuously evened out by adapting the brewing process. As coffee beans deteriorate over time (usually a few days to a few weeks) they get drier, and they need a different grind setting.

Of course, this also involves learning new tools, new brewing techniques, choosing the one best applied for a particular brewing method.

I've been surprised many times to how similar all this is to our own work, to writing, shipping and running code.

Talk that talk

Blue Bottle

It’s fun and interesting to talk to baristas about their work. I've found a lot of them to be happy to share details about what they're doing and why, and they seem to be just as happy to know that there are people who are not just interested in a good cup of joe, but also in how it came to be. They're passionate about their work, just as you are about your code.

Talk to them long enough and they'll think you're working in coffee too. It's pretty fun, it's the equivalent of your customer talking to you about the nuances of concurrency in different programming languages.

It's something that's easy to forget when you spend most of your time with people doing similar work as you do. Compared to a barista, you're just brewing code instead of coffee.

It’s great to talk to other people who are passionate about their work and providing the best value for their customers. It's reaffirming that you're on the right track when you realize that other professions follow similar philosophies.

There's another variable that I have yet to mention: the coffee bean itself. A lot of coffee shops, unsatisfied with the coffee they got from other sources, start looking into roasting their own. They want to take that last variable out of the equation that's under someone else's control.

Plan to throw one (hundred kilos) away

Copenhagen II

Unfortunately, roasting coffee opens a whole new can of worms. Just like it takes time to find the right values for brewing coffee, you need to find the right temperature and roasting time coffee for every single coffee bean.

To get there, lots of coffee gets thrown away. A coffee shop in Berlin recently started roasting, and they went through several hundred kilos of green beans before they came up with a satisfying end result. Let me tell you that the end result is pretty spectacular.

What they basically apply here is rapid prototyping. They iterate around several bags of coffee to find the right conditions to extract the best possibles aroma from the bean.

It sounds insane to throw away all that coffee, but it has to be to make sure the customer gets the best possible value when buying it.

This is why specialty coffee is more expensive than your bag of Starbucks or the coffee you buy at the supermarket. The value for the person enjoying it is a lot higher as there's a lot more to be experienced than just black coffee.

Unsurprisingly, even bad coffee is these days sold for a premium. When you extrapolate K-cups to the volume of a single bag of Cafe Grumpy beans, you end up paying the same or even more.

The value proposition is convenience. The overall experience is worse than when controlling all the brewing steps yourself, but at least you can be sure to get a cup of coffee quickly.

The craft of coffee has a lot of similarities to software development and maintenance. It's a gradual process, with lots of learning and experience involved.

When you run a coffee shop, there comes the time when roasting yourself is the only option, because you want to have control over everything or because the coffee you buys elsewhere is below your quality standards. Or simply because it's more convenient to do everything in-house.

That's like eventually writing your own custom software components or starting to own your infrastructure more and more over time. You need the control to ensure the best possible service to your customers. It means more work on your end, but if it can ensure that your customers are happy, it's well worth the effort.

Four Barrel

Coffee is a personal experience

The one thing that I admire the most about baristas is that they're close to the customer all the time. The customer can follow along every step her coffee takes to get into her hands.

The customer is free to talk to the barista along the process, and most baristas are more than willing to share their insight, what the coffee tastes like and where it came from.

At some Intelligentsia shops, you're even assigned your personal barista that takes you to the entire process of making your coffee. I'm very much in love with that idea. If you stretch that idea to running an internet business it's similar to having a single support person that's taking you through the lifetime of a ticket. As a customer you know that the person on the other hand will know all the details about the issue at hand. It makes the whole experience of customer service a lot more personal.

I went to a coffee shop in Toronto and asked the barista about their favorite coffee, which I commonly do when I'm presented with a lot of choices I haven't tried before. I ended up with a rather dark Sumatran brew from the Clover, one of the greatest technical coffee inventions of all time, sadly they were bought by Starbucks, and it was a bit too dark for my taste.

As a courtesy, she offered me to get another brew, on the house of course. She took charge of her recommendation not meeting my taste and offered me something else for free.

This face-to-face communication also makes it harder to be angry about something. It's still possible, but it's also a lot easier to react to an angry customer when he's right in front of you. If it happens, you offer a free beverage.

Customer experience trumps everything else

That's one of my biggest learnings of the last year, and I have my favorite coffee shops to thank for the inspiration. Personal customer experience trumps everything else, even for a business that's solely accessed through the internet.

You could think that a barista telling you all about their secrets or how to brew excellent coffee will make you stay at home and start making your own coffee all the time.

And so you will. But you will keep coming back because the barista knows you by name, because they learn your taste in coffee, because they give you free samples, because they let you try new coffees first.

That kind of experience is priceless.

A lot of coffee shops have customer loyalty cards. You get a stamp for every coffee and the next coffee is free. I think those loyalty cards are great, and I'm contemplating how they could be applied to internet businesses.

But consider this: instead of knowing that your next coffee will be free, a barista randomly gives you free drinks, new coffee blends, an extra shot of espresso.

Without expecting that next coffee to be free, your happiness levels will be infinitely higher. It's something that I found to make for even more loyal customers and to give them an overall much more personal experience. The surprise trumps every single stamp on your loyalty card.

It's one of the reasons why we send each of our customers a bag of coffee beans. It seems so unrelated to our business, but all of us care about good coffee. And what makes it for the customer is the surprise, them not expecting anything like that from an internet business.

It's also why MailChimp sent out almost 30000 t-shirts last year. After you've successfully launched your first campaign, they send an email to congratulate you and offer to send you a t-shirt. A great and unexpected gesture of customer love. It's worth noting that the shirts are of a great quality, which definitely adds to the surprise.

The similarities of running a coffee shop to running an online business and maintaining software are pretty striking, and you'd think that's only natural, as lots of crafts and running a business are very similar.

Yet the subtleties are what makes every single one of them special, and it's worth looking at them in more detail to see if you can improve your own skills based on the gained knowledge or if you can improve your business' customer relationship efforts.

Both the precision and the customer experience of a good barista and a great coffee shop are something that value one thing: the best possible value for a customer, a great cup of coffee. If you can get one cup of coffee right and make a customer happy, they'll come again, and again, and again.

Getting a customer to stick around, turning them into your most loyal customer, that's the best thing any business, any developer building a customer-facing product can ask for.

Tags: coffee, customers

Two years ago, I wrote about the virtues of monitoring. A lot has changed, a lot has improved, and I've certainly learned a lot since I wrote that initial overview on monitoring as a whole.

There have been a lot of improvements to existing tools, and new players entered the market of monitoring. Infrastructure as a whole got more and more interesting for service business around them.

On the other hand, awareness for monitoring, good metrics, logging and the like has been rising significantly.

At the same time #monitoringsucks raised awareness that a lot of monitoring tools are still stuck in the late nineties when it comes to user interface and the way they work.

Independent of new and old tools, I've had the pleasure of learning a lot more about the real virtues of monitoring, about how it affects daily work and how it evolves over time. This post is about discussing some of these insights.

Monitoring all the way down

When you start monitoring even just small parts of an application, the need for more detail and for information about what's going on in a system arises quickly. You start with an innocent number of application level metrics, add metrics for database and external API latencies, start tracking system level and business metrics.

As you add monitoring to one layer of the system, the need to get more insight into the layer below comes up sooner or later.

One layer has just been tackled recently in a way that's accessible for anyone: communication between services on the network. Boundary has built some pretty cool monitoring stuff that gives you incredibly detailed insight into how services talk to each other, by way of their protocol, how network traffic from inside and outside a network develops over time, and all that down to the second.

The real time view is pretty spectacular to behold.

If you go down even further on a single host, you get to the level where you can monitor disk latencies.

Or you could measure the effect of screaming at a disk array of a running system. dtrace is a pretty incredible tool, and I hope to see it spread and become widely available on Linux systems. It allows you to inject instrumentation into arbitrary parts of the host system, making it possible measure any system call without a lot of overhead.

Heck, even our customer support tool allows us to track metrics for response times, how many tickets and for how long each staff member handled.

It's easy to start obsessing about monitoring and metrics, but there comes a time, when you either realize that you've obsessed for all the right reasons, or you add more monitoring.

Mo' monitoring, mo' problems

The crux of monitoring more layers of a system is that with more monitoring, you can and will detect more issues.

Consider Boundary, for example. It gives you insight into a layer you haven't had insight before, at least not at that granular level. For example, round trip times of liveness traffic in a RabbitMQ cluster.

This gives you a whole new pile of data to obsess about. It's good because that insight is very valuable. But it requires more attention, and more issues require investigation.

You also need to learn how a system behaving normally is reflected in those new systems, and what constitutes unusual behaviour. It takes time to learn and to interpret the data correctly.

In the long run though, that investment is well worth it.

Monitoring is an ongoing process

When we started adding monitoring to Travis CI, we started small. But we quickly realized what metrics really matter and what parts of the application and the infrastructure around it needs more insight, more metrics, more logging.

With every new component deployed to production, new metrics need to be maintained, more logging and new alerting need to be put in place.

The same is true for new parts of the infrastructure. With every new system or service added, new data needs to be collected to ensure the service is running smoothly.

A lot of the experience of what metrics are important there and which aren't, it's something that develops over time. Metrics can come and go, the requirements for metrics are subject to change, just as they are for code.

As you add new metrics, old metrics might become less useful, or you need more metrics in other parts of the setup to make sense of the new ones.

It's a constant process of refining the data you need to have the best possible insight into a running system.

Monitoring can affect production systems

The more data you collect, with higher and higher resolution, the more you run the risk of affecting a running system. Business metrics regularly pulled from the database can become a burden on the database that's supposed to serve your customers.

Pulling data out of running systems is a traditional approach to monitoring, one that's unlikely to go away any time soon. However, it's an approach that's less and less feasible as you increase resolution of your data.

Guaranteeing that this collection process is low on resources is hard. It's even harder to get a system up and running that can handle high-resolution data from a lot of services sent concurrently.

So new approaches have started to pop up to tackle this problem. Instead of pulling data from running processes, the processes themselves collect data and regularly push it to aggregation services which in turn send the data to a system for further aggregation, graphing, and the like.

StatsD is without a doubt the most popular one, and it has sparked a ton of forks in different languages

Instead of relying on TCP with its long connection handshakes and timeouts, StatsD uses UDP. The processes sending data to it stuff short messages into a UDP socket without worrying about whether or not the data arrives.

If some data doesn't make it because of network issues, that only leaves a small dent. It's more important for the system to serve customers than for it to wait around for the aggregation service to become available again.

While StatsD solves the problem of easily collecting and aggregating data without affecting production systems, there's now the problem of being able to inspect the high-resolution data in meaningful ways. Historical analysis and alerting on high-resolution data becomes a whole new challenge.

Riemann has popularized looking at monitoring data as a stream, to which you can apply queries, and form reactions based on those queries. You can move the data window inside the stream back and forth, so you can compare data in a historical context before deciding on whether it's worth an alert or not.

Systems like StatsD and Riemann make it a lot easier for systems to aggregate data without having to rely on polling. Services can just transmit their data without worrying much about how and where they're used for other purposes like log aggregation, graphing or alerting.

The important realization is that with increasing need for scalability and distributed systems, software needs to be built with monitoring in mind.

Imagine RabbitMQ that instead of you having to poll the data from it, sends its metrics as a message at a configurable interval to a configurable fanout. You can choose to consume the data and submit it to a system like StatsD or Riemann, or you can ignore it and the broker will just discard the data.

Who's monitoring the monitoring?

Another fallacy of monitoring is that it needs to be reliable. For it to be fully reliable it needs to be monitored. Wait, what?

Every process that is required to aggregate metrics, to trigger alerts, to analyze logs needs to be running for the system to work properly.

So monitoring in turns needs its own supervision to make sure it's working at all times. As monitoring grows it requires maintenance and operations to take care of it.

Which makes it a bit of a burden for small teams.

Lots of new companies have sprung into life serving this need. Instead of having to worry about running services for logs, metrics and alerting by themselves, it can be left to companies who are more experienced in running them.

Librato Metrics, Papertrail, OpsGenie, LogEntries, Instrumental, NewRelic, DataDog, to name a few. Other companies take the burden of having to run your own Graphite system away from you.

It's been interesting to see new companies pop up in this field, and I'm looking forward to seeing this space develop. The competition from the commercial space is bound to trigger innovation and improvements on the open source front as well.

We're heavy users of external services for log aggregation, collecting metrics and alerting. Simply put, they know better how to run that platform than we do, and it allows us to focus on delivering the best possible customer value.

Monitoring is getting better

Lots of new tools have sprung up in the last two years. While development on it started earlier than that, the most prominent tools are probably Graphite and Logstash. Cubism brings new ideas on how to visualize time series data, one of the several dozens of dashboards that Graphite's existence and flexibility by offering an API has sparked. Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.

It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.

I'm looking forward to seeing how the monitoring space evolves over the next two years.

Over the last year, as we started turning Travis CI into a hosted product, we added a ton of metrics and monitoring. While we started out slow, we soon figured out which metrics are key and which are necessary to monitor the overall behavior of the system.

I built us a custom collector that rakes in metrics from our database and from the API exposed by RabbitMQ. It soon dawned on me that these are our core metrics, and that they need not only graphs, we need to be alerted when they cross thresholds.

The first iteration of that dumped alerts into Campfire. Given that we're a small team and the room might be empty at times, that was just not sufficient for an infrastructure platform that's used by customers and open source projects around the world, at any time of the day.

So we added alerting, by way of OpsGenie. It's set up to trigger alerts via iPhone push notifications and escalations via SMS, should an alert not have been acknowledged or closed within 10 minutes. Eventually, escalation needs to be done via voice calls so that someone really picks up. It's easy to miss a vibrating iPhone when you're sound asleep, but much harder so when it keeps on vibrating until someone picks up.

A Pager for every Developer

Just recently I read an interview with Werner Vogels on architecture and operations at Amazon. He said something that struck with me: "You build it, you run it."

That got me thinking. Should developers of platforms be fully involved in the operations side of things?

A quick survey on Twitter showed that there are some companies where developers are paged when there are production issues, others fully rely on their operations team.

There's merit to both, but I could think of a few reasons why developers should be carrying a pager just like operations does.

You stay connected to what your code does in production. When code is developed, the common tool to manage expectations is to write tests. Unfortunately, no unit test, no integration test will be fully able to reproduce circumstances of what your code is doing in production.

You start thinking about your code running. Reasoning about what a particular piece of code is doing under specific production circumstances is hard, but not entirely impossible. When you're the one responsible for having it run smoothly and serve your customers, this goes up to a whole new level.

Metrics, instrumentation, alerting, logging and error handling suddenly become a natural part of your coding workflow. You start making your software more operable, because you're the one who has to run it. While software should be easy to operate in any circumstances, it commonly isn't. When you're the one having to deal with production issues, that suddenly has a very different appeal.

Code beauty is suddenly a bit less important than making sure your code can treat errors, timeouts, increased latencies. Kind of an ironic twist like that. Code that's resilient to production issues might not have a pretty DSL, it might not be the most beautiful code, but it may be able to sustain whatever issue is thrown at it.

Last, when you're responsible for running things in production, you're forced to learn about the entire stack of an application, not just the code bits, but its runtime, the host system, hardware, network. All that turns into something that feels a lot more natural over time.

I consider that a good thing.

There'll always be situations where something needs to be escalated to the operations team, with deeper knowledge of the hardware, network and the like. But if code breaks in production, and it affects customers, developers should be on the front of fixing it, just like the operations team.

Even more so for teams that don't have any operations people on board. At some point, a simple exception tracker just doesn't cut it anymore, especially when no one gets paged on critical errors.

Being On Call

For small teams in particular, there's a pickle that needs to be solved: who gets up in the middle of the night when an alert goes off?

When you have just a few people on the team, like your average bootstrapping startup, does an on call schedule make sense? This is something I haven't fully figured out yet.

We're currently in the fortunate position that one of our team members is in New Zealand, but we have yet to find a good way to assign on call when he's out or for when he's back on this side of the world.

The folks at dotCloud have written about their schedule, thank you! Hey, you should share your pager and on-call experiences too!

Currently we have a first come first serve setup. When an alert comes in and someone sees it, it gets acknowledged and looked into. If that involves everyone coming online, that's okay for now.

However, it's not an ideal setup, because being able to handle an alert means being able to log into remote systems, restart apps, inspect the database, look at the monitoring charts. Thanks to iPhone and iPad most of that is already possible today.

But to be fully equipped to handle any situation, it's good to have a laptop at hand.

This brings up the question: who's carrying a laptop and when? Which in turns means that some sort of on-call schedule is still required.

We're still struggling on this, so I'd love to read more about how other companies and teams handle that.

Playbooks

During a recent hangops discussion, there was a chat about developers being on call. It brought up an interesting idea, a playbook on how to handle specific alerts.

It's a document explaining things to look into when an alert comes up. Ideally, an alert already includes a link to the relevant section in the book. This is something operations and developers should work on together to make sure all fronts are covered.

It takes away some of the scare of being on call, as you can be sure there's some guidance when an issue comes up.

It also helps refine monitoring and alerts and make sure there are appropriate measures available to handle any of them. If there are not, that part needs improving.

I'm planning on building a playbook for Travis as we go along and refine our monitoring and alerts, it's a neat idea.

Sleepless in Seattle

There's a psychological side to being on-call that needs a lot of getting used to: the thought that an alert could go off at any time. While that's a natural thing, as failures do happen all the time, it's easy to mess up your head. It certainly did that for me.

Lying in bed, not being able to sleep, because your mind is waiting for an alert, it's not a great feeling. It takes getting used to. It's also why having an on-call schedule is preferable over an all hands scenario. When only one person is on call, team mates can at least be sure to get a good night's sleep. As the schedule should be rotating, everyone gets to have that luxury on a regular basis.

It does one thing though: it pushes you to make sure alerts only go off for relevant issues. Not everything needs to be fixed right away, some issues could be taken care of by improving the code, others are only temporary fluxes because of increased network latency and will resolve themselves after just a few minutes. Alerting someone on every other exception raised doesn't cut it anymore, alerts need to be concise and only be triggered when the error is severe enough and affects customers directly. Getting this right is the hard part, and it takes time.

All that urges you to constantly improve your monitoring setup, to increase relevance of alerts, and to make sure that everyone on the team is aware of the issues, how they can come up and how they can be fixed.

It's a good thing.

Tags: operations