Metrics are a very useful tool. They allow you to track variances in your application's and infrastructure's behaviour over time.

They allow you to see what's happening, and to try and predict future trends.

They help you investigate during an outage to figure out which events lead to the problem at hand.

Once you've identified which causes have lead to the problem, a common solution is to add more detailed metrics to allow you to see in even more detail what is happening. Should a similar problem come up again, you'll want to be prepared.

There's a fallacy in all this. It's the fallacy of too many metrics.

With every metric you add, you're not done by simply tracking more data.

You need to:

  • make sure the metric is visible and its meaning is documented and understood
  • decide what to do whether this metric is critical
  • figure out a threshold of when it is critical
  • what kind of alert to send when the threshold is reached
  • who to alert
  • how to respond to an alert

With more metrics being added, business metrics too (think Google Analytics, cohort, retention, customer and funnel metrics), these decisions need to be made. If a metric isn't worth alerting on, is it worth tracking in the first place?

Humans are great at detecting patterns visually, but they're terrible as monitors to constantly watch a pile of graphs. I don't know about you, but I love pretty graphs, but I don't want to watch them all day long to detect any anomalies.

Putting aside the fact that our current idea of graphs is far from being visually pleasing. I fondly remember Neil J. Gunther calling them "gothic graphs".

Metrics increase the cognitive effort required to make sense of them. The more you add, the more effort a human or a team of humans need to put into extracting useful information from them. I highly recommend John Allspaw's Monitorama talk on this topic and the follow-up post.

The answer, quite frequently, is simplification. A bunch of metrics get rolled into a more abstract value and that is displayed and alerted on instead. Unfortunately with simplification, nuances in the data get lost easily.

Add to that our very own biases that make us interpret the data in ways that let us find the simplest explanation, and you've got a recipe for disaster.

Monitoring and finding the right metrics are an ongoing process. Don't be afraid to remove data just as you add new data.

Remember that as you add more information, you also need more cognitive power to process the data during daily operations and, more importantly, during the stressful times of an outage, when we're most prone to our biases taking over rather than rational decisions.

Tags: monitoring

Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this conference, I'm far from being a monitoring expert. If anything, I'm a user, a tinkerer of all the great tools we're hearing about at this conference.

I help run a little continuous integration service called Travis CI. For that purpose I built several home-baked things that help us collect metrics and trigger alerts.

I want to start with a little story. I spend quality time at coffee shops and I enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to the big roasting machine they commonly have a laptop with pretty graphs showing how the temperature in the roaster changes over time. On two occasions I found myself telling them: "Hey cool, I like graphs too!"

On the first occasion I looked at the graph and noticed that it'd update itself every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd really love it if it could update every second." In just two seconds the temperature in the roaster can already drop by almost a degree (Celsius), so he was lacking the granularity to get the best insight into his system.

The second roaster did have one second resolution, and I swooned. But I noticed that every minute or so, he wrote down the current temperature on a sheet of paper. The first guy had done that too. I was curious why they'd do that. He told me that he took it as his reference sheet for the next roasting batch. I asked why he didn't have the data stored in the system. He replied that he didn't trust it enough, because if it lost the information he wouldn't have a reference for his next roasting sheet.

He also keeps a set of coffee bean samples around from previous roasts, roasts where the outcome is known to have resulted in a great roasting result. Even coffee roasters have confirmation bias, though to be fully fair, when you're new to the job, any sort of reference can help you move forward.

This was quite curious. They had the technology yet they didn't trust it enough with their data. But heck, they had one-second resolution and they had the technology to measure data from live sensors in real time.

During my first jobs as a developer touching infrastructure, five minute collection intervals and RRDtool graphs were still very much en vogue. My alerts basically came from Monit throwing unhelpful emails at me stating that some process just changed from one state to another.

Since my days with Munin a lot has changed. We went through the era of #monitoringsucks, which fortunately, quickly turned into the era of #monitoringlove. It's been pretty incredible watching this progress as someone who loves tinkering with new and shiny tools and visualization possibilities. We've seen the emergence of crazy new visualization ideas, like the horizon chart, and we've seen the steady rise of using modern web technologies to render charts, while seeing RRDtool being taken to the next level to visualize time series data.

New approaches providing incredibly detailed insight into network traffic and providing stream analysis of time series data have emerged.

One second resolution is what we're all craving, looking at beautiful and constantly updating charts of 95th percentile values.

And yet, how many of you are still using Nagios?

There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.

Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.

There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring something that's easy to get into for people new to the field.

Monitoring still involves a lot of looking at graphs, correlating several different time series after the fact, and figuring out and checking for thresholds to trigger alerts. In the end, you still find yourself looking at one or more graphs trying to figure out what the hell it means.

Tracking metrics has become very popular, thanks to Coda Hale's metrics library, which inspired a whole slew of libraries for all kinds of languages, and tools like StatsD, which made it very easy to throw any kind of metric at them and have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.

Yet the biggest question that I get every time I talk to someone about monitoring, in particular people new to the idea, is: "what should I even monitor?"

With all the tools we have at hand, helping people to find the data that matters for their systems is still among the biggest hurdles that must be conquered to actually make sense of metrics.

Can we do a better job of educating people what they should track, what they could track, and how they can figure out the most important metrics for their system? It took us six months to find the single metric that best reflects the current state of our system. I called it the soul metric, the one metric that matters most to our users and customers.

We started tracking the time since the last build was started and since the last build was finished.

On our commercial platform, where customers run builds for their own products and customer projects, the weekend is very quiet. We only run one tenth of the number of builds on a Sunday compared to a normal weekday. Sometimes we don't run any build in 60 minutes. Suddenly checking when a build was last triggered makes a lot less sense.

Suddenly we're confronted with the issue that we need to look at multiple metrics in the same context to see if a build should even have been started, as the fact itself is solely based on a customer pushing code. We're suddenly looking at measuring the absence of data (no new commits) and correlate it with data derived from several attributes of the system, like no running builds and no build request being processed.

The only reasonable solution I could come up with, and it's mostly thanks to talking to Eric from Papertrail, is if you need to measure something but it require the existence of an activity, you have to make sure this activity is generated on a regular basis.

In hindsight, it's so obvious, though it brings up a question: if the thing that generates the activity fails, does that mean the system isn't working? Is this worth an alert, is this worth waking someone up for? Certainly not.

This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong?

If a coffee roaster doesn't trust his tools enough to give him a consistent insight into the current, past and future roasting batches, isn't that a weird mismatch between humans and the system that's supposed to give them the assurance that they're on the right path?

A roaster still trusts his instincts more than he trusts the data presented to him. After all, it's all about the resulting coffee bean.

Where does that take us and the current state of monitoring?

We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert.

As much as I enjoy staring at graphs, I'd much rather do something more important than that.

I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.

But much more than that, I'd like our monitoring system to be built for humans, reducing the barrier of entry for adding monitoring and metrics to an application and to infrastructure without much hassle. How we'll get there?

Looking at the current state of monitoring, there's a strong focus on technology, which is great, because it helps solves bigger issues like data storage, visualization and presentation, and stream analysis. I'd love to see this all converge on the single thing that has to make the call in the end: a human. Helping them make a good decision and getting there should be very high on our list.

There is a fallacy in this wish though. With more automation comes a cognitive bias to trust what the system is telling me. Can the data presented to me be fully trusted? Did the system actually make the right call in sending me an alert? This is only something a human can figure, just as a coffee roaster needs to trust his instincts even though the variables for every roast are slightly different.

We want to avoid for our users having to have a piece of paper around that tells them exactly what happened the last time this alert was triggered. We want to make sure they don't have to look at samples of beans at different stages to find confirmation for the problem at hand. If the end user always looks at previous samples of data to compare it to the most recent one, the only thing they'll look for is confirmation.

Lastly, the interfaces of the monitoring tools we work with every day are designed to be efficient, they're designed to dazzle with visualization, yet they're still far from being easy to use. If we want everyone in our company to be able to participate in running a system in production, we have to make sure the systems we provide them with interfaces that treat them as what they are: people.

But most importantly, I'd like to see the word spread on monitoring and metrics, making our user interfaces more accessible and tell the tale of how we monitor our systems, how other people can monitor their systems. There's a lot to learn from each other, and I love things like hangops and OpsSchool, they're great starts to get the word out.

Because it's easier to write things down to realize where you are, to figure out where you want to be.

Two years ago, I wrote about the virtues of monitoring. A lot has changed, a lot has improved, and I've certainly learned a lot since I wrote that initial overview on monitoring as a whole.

There have been a lot of improvements to existing tools, and new players entered the market of monitoring. Infrastructure as a whole got more and more interesting for service business around them.

On the other hand, awareness for monitoring, good metrics, logging and the like has been rising significantly.

At the same time #monitoringsucks raised awareness that a lot of monitoring tools are still stuck in the late nineties when it comes to user interface and the way they work.

Independent of new and old tools, I've had the pleasure of learning a lot more about the real virtues of monitoring, about how it affects daily work and how it evolves over time. This post is about discussing some of these insights.

Monitoring all the way down

When you start monitoring even just small parts of an application, the need for more detail and for information about what's going on in a system arises quickly. You start with an innocent number of application level metrics, add metrics for database and external API latencies, start tracking system level and business metrics.

As you add monitoring to one layer of the system, the need to get more insight into the layer below comes up sooner or later.

One layer has just been tackled recently in a way that's accessible for anyone: communication between services on the network. Boundary has built some pretty cool monitoring stuff that gives you incredibly detailed insight into how services talk to each other, by way of their protocol, how network traffic from inside and outside a network develops over time, and all that down to the second.

The real time view is pretty spectacular to behold.

If you go down even further on a single host, you get to the level where you can monitor disk latencies.

Or you could measure the effect of screaming at a disk array of a running system. dtrace is a pretty incredible tool, and I hope to see it spread and become widely available on Linux systems. It allows you to inject instrumentation into arbitrary parts of the host system, making it possible measure any system call without a lot of overhead.

Heck, even our customer support tool allows us to track metrics for response times, how many tickets and for how long each staff member handled.

It's easy to start obsessing about monitoring and metrics, but there comes a time, when you either realize that you've obsessed for all the right reasons, or you add more monitoring.

Mo' monitoring, mo' problems

The crux of monitoring more layers of a system is that with more monitoring, you can and will detect more issues.

Consider Boundary, for example. It gives you insight into a layer you haven't had insight before, at least not at that granular level. For example, round trip times of liveness traffic in a RabbitMQ cluster.

This gives you a whole new pile of data to obsess about. It's good because that insight is very valuable. But it requires more attention, and more issues require investigation.

You also need to learn how a system behaving normally is reflected in those new systems, and what constitutes unusual behaviour. It takes time to learn and to interpret the data correctly.

In the long run though, that investment is well worth it.

Monitoring is an ongoing process

When we started adding monitoring to Travis CI, we started small. But we quickly realized what metrics really matter and what parts of the application and the infrastructure around it needs more insight, more metrics, more logging.

With every new component deployed to production, new metrics need to be maintained, more logging and new alerting need to be put in place.

The same is true for new parts of the infrastructure. With every new system or service added, new data needs to be collected to ensure the service is running smoothly.

A lot of the experience of what metrics are important there and which aren't, it's something that develops over time. Metrics can come and go, the requirements for metrics are subject to change, just as they are for code.

As you add new metrics, old metrics might become less useful, or you need more metrics in other parts of the setup to make sense of the new ones.

It's a constant process of refining the data you need to have the best possible insight into a running system.

Monitoring can affect production systems

The more data you collect, with higher and higher resolution, the more you run the risk of affecting a running system. Business metrics regularly pulled from the database can become a burden on the database that's supposed to serve your customers.

Pulling data out of running systems is a traditional approach to monitoring, one that's unlikely to go away any time soon. However, it's an approach that's less and less feasible as you increase resolution of your data.

Guaranteeing that this collection process is low on resources is hard. It's even harder to get a system up and running that can handle high-resolution data from a lot of services sent concurrently.

So new approaches have started to pop up to tackle this problem. Instead of pulling data from running processes, the processes themselves collect data and regularly push it to aggregation services which in turn send the data to a system for further aggregation, graphing, and the like.

StatsD is without a doubt the most popular one, and it has sparked a ton of forks in different languages

Instead of relying on TCP with its long connection handshakes and timeouts, StatsD uses UDP. The processes sending data to it stuff short messages into a UDP socket without worrying about whether or not the data arrives.

If some data doesn't make it because of network issues, that only leaves a small dent. It's more important for the system to serve customers than for it to wait around for the aggregation service to become available again.

While StatsD solves the problem of easily collecting and aggregating data without affecting production systems, there's now the problem of being able to inspect the high-resolution data in meaningful ways. Historical analysis and alerting on high-resolution data becomes a whole new challenge.

Riemann has popularized looking at monitoring data as a stream, to which you can apply queries, and form reactions based on those queries. You can move the data window inside the stream back and forth, so you can compare data in a historical context before deciding on whether it's worth an alert or not.

Systems like StatsD and Riemann make it a lot easier for systems to aggregate data without having to rely on polling. Services can just transmit their data without worrying much about how and where they're used for other purposes like log aggregation, graphing or alerting.

The important realization is that with increasing need for scalability and distributed systems, software needs to be built with monitoring in mind.

Imagine RabbitMQ that instead of you having to poll the data from it, sends its metrics as a message at a configurable interval to a configurable fanout. You can choose to consume the data and submit it to a system like StatsD or Riemann, or you can ignore it and the broker will just discard the data.

Who's monitoring the monitoring?

Another fallacy of monitoring is that it needs to be reliable. For it to be fully reliable it needs to be monitored. Wait, what?

Every process that is required to aggregate metrics, to trigger alerts, to analyze logs needs to be running for the system to work properly.

So monitoring in turns needs its own supervision to make sure it's working at all times. As monitoring grows it requires maintenance and operations to take care of it.

Which makes it a bit of a burden for small teams.

Lots of new companies have sprung into life serving this need. Instead of having to worry about running services for logs, metrics and alerting by themselves, it can be left to companies who are more experienced in running them.

Librato Metrics, Papertrail, OpsGenie, LogEntries, Instrumental, NewRelic, DataDog, to name a few. Other companies take the burden of having to run your own Graphite system away from you.

It's been interesting to see new companies pop up in this field, and I'm looking forward to seeing this space develop. The competition from the commercial space is bound to trigger innovation and improvements on the open source front as well.

We're heavy users of external services for log aggregation, collecting metrics and alerting. Simply put, they know better how to run that platform than we do, and it allows us to focus on delivering the best possible customer value.

Monitoring is getting better

Lots of new tools have sprung up in the last two years. While development on it started earlier than that, the most prominent tools are probably Graphite and Logstash. Cubism brings new ideas on how to visualize time series data, one of the several dozens of dashboards that Graphite's existence and flexibility by offering an API has sparked. Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.

It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.

I'm looking forward to seeing how the monitoring space evolves over the next two years.

The need to measure everything that moves in a distributed system or even simple web apps is becoming the basis for thorough monitoring of an application.

However, there is one thing that's starting to get in the way of of getting good measurements of all layers in a system: client libraries used to talk to network services, be it the database, an API, a message bus, anything that's bound to the intricate latency variances of the network stack.

Without full instrumentation of all parts of the application's stack, it's going to be very hard to figure out where exactly a problems boils down to. Measuring client access to a network service in addition to collecting data on the other end, e.g. the slow query log, allows you to pinpoint issues to the network, to increased latency, or to parsing responses.

If the other end is not under your control, it's just as important to have this data available. Having good metrics on request latencies to an external service, even a database hosted by a third party, gives you a minimum amount of confidence that while you maybe can't fix the underlying problem, you at least have the data to show where the problem is most likely to be. Useful data to have when approaching the third party vendor or hosting company about the issue.

Rails has set a surprisingly good example, by way of ActiveSupport::Notifications. Controller requests are instrumented just as database queries of any kind.

You can subscribe to the notifications and start collecting them in your own metrics tool. StatsD, Graphite and Librato Metrics are pretty great tools for this purpose.

There's not much a client library needs to do to emit measurements of network requests. The ones for Ruby could start by adding optional instrumentation based on AS::Notifications. That'd ensure that ActiveSupport itself doesn't turn into a direct dependency. I'd love to see the notifications bit being extracted into a separate library that's easier to integrate than pulling in the entire ActiveSupport ball of mud.

Node.js has EventEmitters, which are similar to AS::Notifications, and they lend themselves quite nicely for this purpose.

I've dabbled with this for riak-js, the Node.js library for Riak. There's an example that shows how to register and collect the metrics from the events emitted. The library itself just emits the events at the right spot, adds some timestamps so that event listeners can reconstruct the trail of a request.

It worked out pretty well and is just as easy to plug into a metrics library or to report measurements directly to StatsD.

The thing that matters is that any library for a network service you write or maintain, should have some sort of instrumentation built in. Your users and I will be forever grateful.

This goes both ways, too. Network servers need to be just as diligent in collecting and exposing data as the client libraries talking to them. Historically, though, a lot of servers already expose a lot of data, not always in a convenient format, but at least it's there.

Build every layer of your application and library with instrumentation in mind. Next time you have to tackle an issue in any part of the stack, you'll be glad you did.

Now go and measure everything!

Tags: monitoring

Over the last year I haven't only grown very fond of coffee, but also of infrastructure. Working on Scalarium has been a fun ride so far, for all kinds of reasons, one of them is dealing so much with infrastructure. Being an infrastructure platform provider, what can you do, right?

As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my job I thought it'd be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think Heroku), there will always be infrastructure, and it will always affect you and your application in some way.

On today's menu: monitoring. People have all kinds of different meanings for monitoring, and they're all right, because there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than six levels of detail you can and probably should get. Note that these are my definitions, they don't necessarily have to be officially named, they're solely based on my experiences. Let's start from the top, the outside view of your application.

Availability Level

Availability is a simple measure to the user, either your site is available or it's not. There is nothing in between. When it's slow, it's not available. It's a beautifully binary measure really. From your point of view, any component or layer in your infrastructure could be the problem. The art is to quickly find out which one it is.

So how do you notice when your site is not available? Waiting for your users to tell you is an option, but generally a pretty embarrassing one. Instead you generally start polling some part of your site that's representative of it as a whole. When that particular site is not available, your whole application may as well not be.

What that page should do is get a quick measure of the most important components of your site, check if they're available (maybe even with a timeout involved so you get an idea if a specific component is broken) and return the result. An external process can then monitor that page and notify you when it doesn't return the expected result. Make sure the site does a bit more than just return "OK". If it doesn't hit any of the major components in your stack, there's a chance you're not going to notice that e.g. your database is becoming unavailable.

You should run this process from a different host, but what do you do if that host is not available? Even as an infrastructure provider I like outsourcing parts of my own infrastructure. Here's where Pingdom comes into play. They can monitor a specific URL, TCP ports and whatnot from some two dozen locations across the planet and they randomly go through all of them, notifying you when your site is unavailable or the result doesn't match the expectations.


Business Level

These aren't necessarily metrics related to your application's or infrastructure's availability, they're more along the lines of what your users are doing right now, or have done over the last month. Think number of new users per day, number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google Analytics or click paths (using tools like Hummingbird, for example) in general also fall into this category.

These kind of metrics may be more important to your business than to your infrastructure, but they're important nonetheless, and they could e.g. be integrated with another metrics collection tool, some of which we'll get to in a minute. Depending on what kind of data you're gathering they're also useful to analyze spikes in your application's performance.

This kind of data can be hard to track in a generic way. Usually it's up to your application to gather them and turn them into a format that's acceptable to a different tool to collect them. They're also usually very specific to your application and its business model.

Application Level

Digging deeper from the outsider's view, you want to be able to track what's going on inside of your application right now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are slow, what kinds of errors are being caused by your application, to name a few.

This will give you an overview of the innards of your code, and it's simply invaluable to have that kind of insight. You usually don't need much historical data in this area, just a couple of days worth will usually be enough to analyze problems in retrospect. It can't hurt to keep them around though, because growth also shows trends in potential application code hot spots or database queries getting slower over time.

To get an inside view of your application, services like New Relic exist. While their services aren't exactly cheap (most monitoring services aren't, no surprise here), they're invaluable. You can dig down from the Rails controller level to find the method calls and database queries that are slowest at a given moment in time (most likely you'll be wanting to check the data for the last hours to analyze an incident), digging deeper into other metrics from there. Here's an example of what it looks like.

New Relic

You can also use the Rails log file and tools like Request-log-analyzer. They can help you get started for free, but don't expect a similar, fine-grained level of detail like you get with New Relic. However, with Rails 3 it's become a lot easier to instrument code that's interesting to you and gather data on runtimes of specific methods yourself.

Other means are e.g. JMX, one of the neat features you get when using a JVM-based language like JRuby. Your application can contiuously collect and expose metrics through a defined interface to be inspected or gathered by other means. JMX can even be used to call into your application from the outside, without having to go through a web interface.

Application level monitoring also includes exception reporting. Services like Exceptional or Airbrake Bug Tracker are probably the most well known in that area, though in higher price regions New Relic also includes exception reporting.

Process Level

Going deeper (closer to inception than you think) from the application level we reach the processes that serve your application. Application servers, databases, web servers, background processing, they all need a process to be available.

But processes crash. It's a bitter and harsh truth, but they do, for whatever reason, maybe they consumed too many resources, causing the machine to swap or the process to simply crash because the machine doesn't have any memory left to allocate. Think of a memory leaking Rails application server process or the last time you used RMagick.

Someone must ensure that the processes keep running or that they don't consume more resources than they're allowed to, to ensure availability on that level. These tools are called supervisors. Give them a pid file and a process, running or not, and they'll make sure that it is. Whether a process is running can depend on multiple metrics, availability over the network, a file size (think log files) or simply the existence of the process, while allowing you to send some sort of grace period, so they'll retry a number of times with a timeout before actually restarting the process or giving up monitoring it altogether.

A good supervisor will also let you alert someone when the expected conditions move outside or their acceptable perimeter and a process had to be restarted. A classic in this area is Monit, but people also like God and Bluepill. On a lower level you have tools like runit or upstart, but their capabilities are usually built around a pid file and a process, not allowing to go on a higher level of checking system resources.

While I find the syntax of Monit's configuration to not be very aesthetically pleasing, it's proven to be reliable and has a very small footprint on the system, so it's our default on our own infrastructure, and we add it to most our cookbooks for Scalarium, as it's installed on all managed instances anyway. It's a matter of preference.

Infrastructure/Server Level

Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area can't be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the data efficiently.

This data is very important in several ways. For one, it will show you what your servers are doing right now, or in the last couple of minutes, which is usually enough to notice a problem. Second, the data collected is very useful to discover trends, e.g. memory usage increasing over time, swap usage increasing, or a partition running out of disk space. Any value constantly increasing over time is a good sign that you'll hit a wall at some point. Noticing trends will usually give you a good indication that something needs to be changed in your infrastructure or your application.


There's a countless number of tools in this area, Munin (see screenshot), Nagios, Ganglia, collectd on the open source end, and CloudKick, Circonus, Server Density and Scout on the paid service level, and an abundance of commercial tools on the very expensive end of server monitoring. I never really bother with the commercial ones, because I either resort to the open source tools or pay someone to take care of the monitoring and alerting for me on a service basis. Most of these tools will run some sort of agent on every system, collecting data in a predefined cycle, delivering it to a master process, or the master processing picking up the data from the agents.

Again, it's a matter of taste. Most of the open source tools available tend to look pretty ugly on the UI part, but if the data and the graphs are all that matters to you, they'll do just fine. We do our own server monitoring using Server Density, but on Scalarium we resort to using Ganglia as an integrated default, because it's much more cost effective on our users, and given the elastic nature of EC2 it's much easier for us to add and remove instances as they come and go. In general I'm also a fan of Munin.

Most of them come with some sort of alerting that allows you to define thresholds which trigger the alerts. You'll never get the thresholds right the first time you configure them, constantly keep an eye on them to get a picture of what thresholds are normal, and which are indeed problem areas and require an alert to be triggered.

The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect business level data, utilizing the existing graphing and even alerting capabilities.

Log Files

The much dreaded log file won't go out of style for a long time, that's for sure. Your web server, your database, your Rails application, your application server, your mail server, all of them dump more or less useful information into log files. They're usually the most immediate and uptodate view of what's going on in your application, if you chose to actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background services sure are, or any other service running on your servers. The log is the first to know when there's problems delivering email or your web server is returning an unexpected amount of 500 errors.

The biggest problem however is aggregating the log data, centralized logging if you will. syslog and all the alternative tools are traditionally sufficient, while on the larger scale end you have custom tools like Cloudera's Flume or Facebook's Scribe. There's also a bunch of paid services specializing on logging, most noteworthy are Splunk and Loggly. Loggly relies on syslog to collect and transmit data from your servers, but they also have a custom API to transmit data. The data is indexed and can easily be searched, which is usually exactly what you want to do with logs. Think about the last time you grepped for something in multiple log files, trying to narrow down the data found to a specific time frame.

There's a couple of open source tools available too, Graylog2 is a syslog server with a MongoDB backend and a Java server to act as a syslog endpoint, and a web UI allowing nicer access to the log data. A bit more kick-ass is logstash which uses RabbitMQ and ElasticSearch for indexing and searching log data. Almost like a self-hosted Loggly.

When properly aggregated log files can show trends too, but aggregating them gets much harder the more log data your infrastructure accumulates.

ZOMG! So much monitoring, really?

Infrastructure purists would start by saying that there's a different between monitoring, metrics gathering and log files. To me, they're a similar means to a similar end. It doesn't exactly matter what you call it, the important thing is to collect and evaluate the data.

I'm not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is however one reason why eventually you'll want to have most if not all of them. At any incident in your application or infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow queries, or problems introduced by recent deployments.

For example, your site's performance is becoming sluggish in certain areas, users start complaining. Application level monitoring indicates specific actions taking longer than usual, pointing to a specific query. Server monitoring for your database master indicates an increased number of I/O waits, usually a sign that too much data is read from or written to disk. Simplest reason could be an index missing or that your data doesn't fit into memory anymore and too much of it is swapped out to disk. You'll finally be looking at MySQL's slow query log (or something similar for your favorite database) to find out what query is causing the trouble, eventually (and hopefully) fixing it.

That's the power of monitoring, and you just can't put any price on a good setup that will give you all the data and information you need to assess incidents or predict trends. And while you can set up a lot of this yourself, it doesn't hurt to look into paid options. Managing monitoring yourself means managing more infrastructure. If you can afford to pay someone else to do it for you, look at some of the mentioned services, which I have no affiliation with, I just think they're incredibly useful.

Even being an infrastructure enthusiast myself, I'm not shy of outsourcing where it makes sense. Added features like SMS alerts, iPhone push notifications should also be taken into account. Remember that it'd be up to you to implement all this. It's not without irony that I mention PagerDuty. They sit on top of all the other monitoring solutions you have implemented and just take care of the alerting, with the added benefit of on-call schedules, alert escalation and more.

The guys over at Pivotal Labs wrote a small piece on a neat tool called XRay. It hooks into your Ruby code to provide Java-like signal handlers which dump the current stack trace into whatever log file seems fit. Using Passenger that'll be your Apache's error log file.

Say what you want about Java and it's enterprisey bloated-ness, but some features of it come in quite handy, especially when they allow looking into your processes without immediately having to turn to tools like NewRelic or FiveRuns.

Just add the following line to your application.

require "xray/thread_dump_signal_handler"

From then on you can use kill -QUIT <pid> to get a trace dump in your log file. Neat! The website says you need to patch Ruby, but for me it worked with the Ruby that comes with Leopard, and Ruby Enterprise Edition.

Tags: ruby, monitoring