At this year's Scottish Ruby Conference, I gave a talk about EventMachine, slides are available. Amidst the hype around Node.js it's too easy to forget that Ruby has had evented I/O libraries for years. EventMachine, libebb, rev, cool.io, to name a few. As a general introduction I recommend reading Dan Kegel's article on the C10K problem, the problem of handling 10000 server connections on a single machine. It introduces all the evented approaches that have been implemented in the different operating systems over the last some 15 years.

In preparation for the talk I got curious about EventMachine's innards. So I thought it'd be nice to share my findings with you. Node.js kids, pay attention, this concerns you as well. It may be JavaScript, but in the end Node.js works in a similar fashion, though it builds on libev, which does most of the plumbing for the different operating system implementations of non-blocking I/O.

Most of the magic happens inside the C++ part of EventMachine, so now's as good a time as any to dig into it and find out how it works. There'll be code in here, not assembler, but I'll be throwing constants, standard library functions and TCP networking bits (from C, not from Ruby) at you. There's no magic however, and when in doubt, consult the man pages. You do know about man pages, right? They're awesome.

while(true): The Event Loop

EventMachine is based on the idea of an event loop, which is basically nothing more than an endless loop. The standard snippet of code you're wrapping all your evented code is this:

EM.run do
  # go forth and handle events
end

You can look at the details of what the method does in its full glory. Other than initializing some things, it dives down into the C++ layer immediately, and it's where most of the magic happens from now on.

Three C/C++ extension files are of importance, ext/rubymain.cpp is the bridge between Ruby and the C code layer. It uses Ruby's C functions, mostly to convert datatypes for the later below. It then calls into code defined in ext/cmain.cpp, which in turn bridges the C and the C++ code.

When you call EM.run to start the event loop, it calls down into the C layer to t_run_machine_without_threads, which is called as run_machine, and which in turn calls EventMachine_t::Run(), whose interesting bits are shown below.

 while (true) {
   _UpdateTime();
   if (!_RunTimers())
     break;

  _AddNewDescriptors();
  _ModifyDescriptors();

  if (!_RunOnce())
    break;
  if (bTerminateSignalReceived)
    break;
}

There's your event loop, doesn't look that scary now, right? It basically does five things:

  • Update the current time (line 2)

    Used in the next step to determine whether a timer should be fired

  • Run configured timers (line 3)

    All the timers specified through either add_timer or add_periodic_timer are run here. When you add a timer, EventMachine stores it in a map indexed with the time it's supposed to fire. This makes checking the list for the ones that should be fired in the current iteration a cheap operation.

    _RunTimers() iterates over the list of timers until it reaches one entry whose key (i.e. the time it's supposed to fire) is higher than the current time. Easy and efficient.

    On a side note, _RunTimers always returns true, so it's a bit weird that the return value is checked.

  • Add new descriptors (line 6)

    Whenever you open a new server connection, EventMachine adds an object representing the connection and the associated callbacks to this list. All connections and descriptors created in the last iteration are handled, which basically includes setting additional options if applicable and add them to the list of active connections.

    On the operating system level a descriptor represents a file handle or a socket connection. When you open a file, create a connection to another machine or create a server to listen for incoming connections, all of them are represented by descriptors, which are basically integers pointing into a list maintained by the operating system.

  • Modify descriptors (line 7)

    Modify existing descriptors, if applicable. This only has any effect when you're using epoll, which we'll get to later.

  • Run the event (line 9)

    Check all open file descriptors for new input. Read whatever's available, run the associated event callbacks. The heart of the event loop, worth taking a closer look below.

The event loop really is just an endless loop after all.

Open a Socket

When you call EM.connect to open a connection to a remote server, the connection will be immediately created, but it may not finish until later. The resulting connection will have a bunch of properties:

  • The descriptor is configured to not block on input and output by setting the socket option O_NONBLOCK. This way reads will immediately return when there's no data instead of waiting for some to arrive, and writes don't necessarily write all the data they're given. It also means that a call to connect() to create a new connection returns before it's fully created.

  • The Nagle algorithm is disabled to prevent the TCP stack from delaying sending packets by setting TCP_NODELAY on the socket. The operating system wants to buffer output to send fewer packets. Disabling Nagle causes any writes to be sent immediately. As EventMachine does internal buffering, it's preferrable for the data to be really sent when it's eventually written to a socket.

  • Reuse connections in TIME_WAIT state before they're fully removed from the networking stack. TCP keeps connections around for a while, even after they're closed to ensure that all data from the other side really, really made it to your end. Nice and all, but in environments with a high fluctuation of connnections, in the range of hundreds to thousands per second, you'll run out of file descriptors in no time.

Opening a socket is an immediate event, it happens as soon as you create a new connection. Running any callbacks on it won't happen until the next iteration of the event loop. That's why it's safe to e.g. fire up a new HTTP request and then attach callbacks to it. Even if that wouldn't be the case, EventMachine's Deferrables (not to be confused with EM.defer) ensure that callbacks are fired even after the original event fired, when they're added at a later time.

What is immediately called, though, is the post_init method on the connection object.

Opening a network connection is just one thing you can do with EventMachine, but as it's the one thing you're most likely to do when you're using it, let's leave it at that.

Don't call us, we'll call you

Working with asynchronous code in EventMachine usually involves callbacks, unless you work with your own connection class. Libraries like em-http-request rely on deferrables to communicate with your application. They're fired when a HTTP request finished or failed. But how does a library keep track of data that only comes in bit by bit?

The answer is simply buffering. Which brings us to the core of the event loop, checking sockets for input, which is done from the ominous _RunOnce() method in the code snippet above. EventMachine can utilize three mechanisms to check descriptors for new input.

select(*)

The default is using select(), a standard system call to check a collection of file descriptors for input, by way of Ruby's implementation rb_thread_select(), which wraps the call to select() with a bunch of code ensuring thread safety.

Using select() pretty much works everywhere, and is perfectly fine up to a certain number of file descriptors. If you're simply serving an asynchronous web application or API using EventMachine, this may be totally acceptable.

Implementing this way of handling I/O is rather straight-forward, if you look at the implementation. Collect all file descriptors that may be of interest, feed them into select, read and/or write data when possible.

What makes using select() a bit cumbersome is that you always have to assemble a list of all file descriptors for every call to _RunOnce(), so EventMachine iterates over all registered descriptors with every loop. After select ran, it loops over all file descriptors again, checking to see if select marked them as ready for reads and/or writes.

When select() marks a descriptor as ready for read or write operations that means the socket will not block when data is read from or written to it. In the case of reading that usually means the operating system has some data buffered somewhere, and it's safe to read that data without having to wait for it to arrive, which in turn would block the call.

Instead of using select(), EventMachine could also use poll() instead, which just handles a bit nicer in general, but is not available in the Ruby VM.

epoll

epoll is Linux' implementation for multiplexing I/O across a large number of file descriptors.

The basic steps of using epoll are simple:

  • Set up an epoll instance using epoll_create, done initially when the event loop is created. This creates a virtual file descriptor pointing to a data structure that keeps track of all real file descriptors associated with it in the next step.

    You only need to reference this single file descriptor later, so there's no need to collect a list of all file descriptors, as is the case when select() is used.

  • Register interest for events on a file descriptor using epoll_ctl on the epoll instance created above.

    This is used in _AddNewDescriptors and _ModifyDescriptors to register and update EventMachine's file descriptors with epoll. In fact, both methods only do anything noteworthy when epoll is used. Otherwise they just iterate over a list of descriptors, pretty much doing nothing with them.

  • Wait for input with epoll_wait for a specified duration. You can wait forever, return immediately if nothing happened, or wait for a specific amount of time.

    EventMachine seems to have chosen to return immediately if there's no activity. There's an alternative implementation calculating the time to wait based on the likelihood of a specific event (e.g a timer firing) to fire on the next event loop iteration, but it doesn't seem to ever be used. Seems to be a relict from the time it could also be used as a C++ library.

epoll events are registered for both reads and writes, with epoll_wait returning the number of file descriptors that are ready for both events.

Using epoll has a big advantage, aside from being much more efficient than select in general for larger sets of file descriptors. It spares code using it the burden of constantly iterating over a list of file descriptors. Instead you just register them once, and then only iterate over the ones affected by the last call to epoll_wait.

So epoll requires a bit more work when you add or modify connections, but is a bit nicer on the eyes when it comes to actually polling them for I/O availability.

Note that epoll support must be explicitly enabled using EM.epoll.

kqueue

kqueue is the BSD equivalent of epoll, and is available on e.g. FreeBSD and Mac OS X. It works very similar to epoll. If you want to know more details, I'd suggest reading the paper on it by Jonathan Lemon.

You can enable kqueue support using EM.kqueue, which is, just like EM.epoll, a noop on systems that don't support it. Hopefully future EM versions will use whatever's available on a particular system as default.

Call me already!

All three mechanisms used have one thing in common: when data is read, receive_data is called immediately, which brings us back to the question of how a connection objects collects data coming in.

Whenever data is ready to be consumed from a socket, EventMachine calls EventDescriptor::Read(), which reads a bunch of data from the socket, in turn calling read() on the file descriptor, and then immediately executes the callback associated with the descriptor, which usually ends up calling receive_data with the data that was just read. Note that the callback here refers to something defined on the C++ level, not yet a Ruby block you'd normally use in an asynchronous programming model.

receive_data is where you will usually either buffer data or run some action immediately. em-http-request feeds the data coming in directly into an HTTP parser. Whatever you do in here, make it quick, don't process the data for too long. A common pattern in libraries built on EventMachine is to use a Deferrable object to keep track of a request's state, firing callbacks when it either succeeded or failed.

Which brings me to the golden rule of programming with libraries like EventMachine and Node.js: DON'T BLOCK THE EVENT LOOP!! Defer whatever work you can to a later run of the loop when it makes sense, or push it to another asynchronous processing facility, e.g. a message queue like RabbitMQ or Redis' Pub/Sub.

In a similar fashion, whenever you write data to a connection using send_data, it's first buffered, and not actually sent until the socket is ready for a non-blocking call to write(). Hence all three implementations check for both read and write availability of a descriptor.

Fibers vs. Spaghetti

Where do Ruby's Fibers come in here? Callbacks can easily lead to spaghetti code, especially when you have to nest them to run multiple asynchronous actions in succession.

Fibers can stop execution of a process flow at any time and yield control to some other, controlling entity or another Fiber. You could, for example, wrap a single HTTP request into a fiber and yield back control when all the callbacks have been assigned.

In the callbacks you then resume the Fiber again, so that processing flow turns into a synchronous, procedural style again.

  def get(url)
    Fiber.new do
      current_fiber = Fiber.current
      request = EM::HttpRequest.new(url).get
      request.callback { current_fiber.resume(request) }
      request.errback  { current_fiber.resume(request) }
      Fiber.yield
    end
  end

Fiber.yield returns whatever object it was handed in Fiber.resume. Wrap this in a method and boom, there's your synchronous workflow. Now all you need to do is call get('http://paperplanes.de') and assign something with the return value. Many props to Xavier Shay for digging into the Goliath source to find out how that stuff works. Helped me a lot to understand how that stuff works. If you never had a proper use case for Fibers in real life, you do now.

em-synchrony is a library doing just that for a lot of existing EventMachine libraries, and Goliath is an evented web server, wrapping a Rack-style API using Fibers.

Things you should be reading

Here's a bunch of free reading tips for ya. These books are pretty old, but have gone through some revisions and updates, and they're still the classics when it comes to lower level Unix (network) programming and understanding TCP/IP, which I consider very important. TCP/IP Illustrated is one of the best books I've read so far, and I consider it essential knowledge to be aware of what happens under the networking hood.

Also, read the fine man pages. There's a whole bunch of good documentation installed on every Unix-style system, and I linked to a couple of them relevant to this post already. Read it.

yield

This concludes today's whirlwind tour through some of EventMachine's internals. There's actually not too much magic happening under the covers, it's just wrapped into a bit too much code layering for my taste. But you be the judge.

Play with EventMachine and/or Node.js if you haven't already, try to wrap your head around the asynchronous programming model. But for the love of scaling, don't look at evented and asynchronous I/O as the sole means of scaling, because it's not.

Last weekend I tweeted two links to two tweets by a poor guy who apparently got his MongoDB database into an unrecoverable state during shutdown whilst upgrading to a newer version. That tweet quickly made the rounds, and the next morning I saw myself staring at replies stating that it was all his fault, because he 1.) used kill -9 to shut it down because apparently the process hung (my guess is it was in the middle of flushing all data to disk) and 2.) didn't have a slave, just one database instance.

Others went as far as indirectly calling him an idiot. Oh interwebs, you make me sad. If you check out the thread on the mailing list, you'll notice a similar pattern in reasoning. The folks over at http://learnmongo.com seem to want to be the wittiest of them all, recommending to always have a recent backup, a slave or replica set and to never kill -9 your database.

While you can argue that the guy should've known better, there's something very much at odds here, and it seems to become a terrifying meme with fans of MongoDB, the idea that you need to do all of these things to get the insurance of your data being durable. Don't have a replica? Your fault. kill -9 on a database, any database? You mad? Should've read the documentation first, dude. This whole issue goes a bit deeper than just reading documentation, it's the fundamental design decision of how MongoDB treats your data, and it's been my biggest gripe from the get go. I can't help but be horrified by these comments.

I've heard the same reasoning over and over again, and also that it just hasn't happened so far, noone's really lost any considerable data. The problem is, most people never talk about it publicly, because it's embarrassing, best proof is the poor guy above. This issue is not even related to MongoDB, it's a general problem.

Memory-Mapped Persistence

But let me start at the beginning, MongoDB's persistence cycle, and then get to what's being done to improve its reliability and your data's durability. At the very heart, MongoDB uses memory-mapped files to store data. A memory-mapped file is a data structure that has the same representation on disk as it has when loaded into memory. When you access a document in MongoDB, loading it from disk is transparent to MongoDB itself, it can just go ahead and write to the address in memory, as every database in MongoDB is mapped to a dynamically allocated set of files on disk. Note that memory-mapped files are something you won't find in a lot of other databases, if any at all. Most do their own house-keeping and use custom data structures for that purpose.

The memory mapping library (in MongoDB's case the POSIX functions, and whatever Windows offers in that area) will take care of handling the flush back to disk every 60 seconds (configurable). Everything in between happens solely in memory. Database crash one second before the flush strikes again? You just lost most of the data that was written in the last 59 seconds. Just to be clear, the flushing cycle is configurable, and you should consider choosing a better value depending on what kind of data you're storing.

MongoDB's much praised insert speed? This is where it comes from. When you write stuff directly to local memory, they better be fast. The persistence cycle is simple: accept writes for 60 seconds, then flush the whole thing to disk. Wait for another 60 seconds, then flush again, and so on. Of course MongoDB also flushes the data when you shut it down. But, and here's the kicker, of course that flush will fail when you kill it without mercy, using the KILL signal, just like the poor guy above did apparently. When you kill something that writes a big set binary data to disk, all bets are off. One bit landing on the wrong foot and the database can get corrupted.

Database Crashes are Unavoidable

This scenario can and does happen in e.g. MySQL too, it even happens with CouchDB, but the difference is, that in MySQL you usually only have a slightly damaged region, which can be fixed by deleting and re-inserting it. In CouchDB, all that happens is that your last writes may be broken, but CouchDB simply walks all the way back to the last successful write and runs happily ever after.

My point here is simple: even when killed using the KILL signal, a database should not be unrecoverable. It simply shouldn't be allowed to happen. You can blame the guy all you want for using kill -9, but consider the fact that it's the process equivalent of a server or even just the database process crashing hard. Which happens, believe it or not.

Yes, you can and probably will have a replica eventually, but it shouldn't be the sole precondition to get a durable database. And this is what horrifies me, people seem to accept that this is simply one of MongoDB's trade-offs, and that it should just be considered normal. They shouldn't, it needs more guys like the one causing all the stir bringing up these isses, even though it's partly his fault, to show the world what can happen when worse comes to worst.

People need to ask more questions, and not just accept answers like: don't use kill -9, or always have a replica around. Servers crash, and your database needs to be able to deal with it.

Durability Improvements in MongoDB 1.7/1.8

Now, the MongoDB folks aren't completely deaf, and I'm happy to report they've been working on improvements in the area of data durability for a while, and you can play with the new durability option in the latest builds of the 1.7 branch, and just a couple of hours ago, there was activity in improving the repair tools to better deal with corrupted databases. I welcome these changes, very much so. MongoDB has great traction, a pretty good feature set, and the speed seems to blow peoples' minds. Data durability has not been one of its strengths though, so I'm glad there's been a lot of activity in that area.

If you start the MongoDB server with the new --dur option, and it will start keeping a journal. When your database crashed, the journal is simply replayed to restore all changes since the last successful flush. This is not a particularly special idea, because it's how your favorite relation database has been working for ages, and not unsimilar to the storage model of other databases in the NoSQL space. It's a good trade-off between keeping good write speed and getting a much more durable dataset.

When you kill your database harshly in between flushes with a good pile of writes in between, you don't lose a lot of data anymore, maybe a second's worth (just as you do with MySQL when you use InnoDB's delayed flushing), if any at all, but not much more than that. Note that these are observation based on a build that's now already more than a month old. Situation may have improved since then. Operations are put into a buffer in memory, from where they're both logged to disk into the journal, and then applied to the dataset. When writing the data to memory, it has already been written to the journal. Journals are rotated once they reach a certain size and it's ensured that all their data has been applied to the dataset.

A recovery process applies all uncommitted changes from the log when the database crashes. This way it's ensured that you only lose a minimum set of data, if none at all, when your database server crashes hard. In theory the journal could be used to restore a corrupted in a scenario as outlined above, so it's pretty neat in my opinion. Either way, the risk of losing data is now pretty low. In case your curious for code, the magic happens in this method.

I for one am glad to see improvements in this area of MongoDB, and I'm secretly hoping that durable will become the default mode, though I don't see it happening for marketing reasons anytime soon. Also, be aware that durability brings more overhead. In some initial tests however, the speed difference between non-durable and durable MongoDB was almost not worth mentioning, though I wouldn't call them representative, but in general there's no excuse to not use it really.

It's not yet production ready, but nothing should keep you from playing with it to get an idea of what it does.

Bottom Line

It's okay to accept trade-offs with whatever database you choose to your own liking. However, in my opinion, the potential of losing all your data when you use kill -9 to stop it should not be one of them, nor should accepting that you always need a slave to achieve any level of durability. The problem is less with the fact that it's MongoDB's current way of doing persistence, it's with people implying that it's a seemingly good choice. I don't accept it as such. If you can live with that, which hopefully you don't have to for much longer anyway, that's fine with me, it's not my data anyway. Or maybe I'm just too paranoid.

Over the last year I haven't only grown very fond of coffee, but also of infrastructure. Working on Scalarium has been a fun ride so far, for all kinds of reasons, one of them is dealing so much with infrastructure. Being an infrastructure platform provider, what can you do, right?

As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my job I thought it'd be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think Heroku), there will always be infrastructure, and it will always affect you and your application in some way.

On today's menu: monitoring. People have all kinds of different meanings for monitoring, and they're all right, because there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than six levels of detail you can and probably should get. Note that these are my definitions, they don't necessarily have to be officially named, they're solely based on my experiences. Let's start from the top, the outside view of your application.

Availability Level

Availability is a simple measure to the user, either your site is available or it's not. There is nothing in between. When it's slow, it's not available. It's a beautifully binary measure really. From your point of view, any component or layer in your infrastructure could be the problem. The art is to quickly find out which one it is.

So how do you notice when your site is not available? Waiting for your users to tell you is an option, but generally a pretty embarrassing one. Instead you generally start polling some part of your site that's representative of it as a whole. When that particular site is not available, your whole application may as well not be.

What that page should do is get a quick measure of the most important components of your site, check if they're available (maybe even with a timeout involved so you get an idea if a specific component is broken) and return the result. An external process can then monitor that page and notify you when it doesn't return the expected result. Make sure the site does a bit more than just return "OK". If it doesn't hit any of the major components in your stack, there's a chance you're not going to notice that e.g. your database is becoming unavailable.

You should run this process from a different host, but what do you do if that host is not available? Even as an infrastructure provider I like outsourcing parts of my own infrastructure. Here's where Pingdom comes into play. They can monitor a specific URL, TCP ports and whatnot from some two dozen locations across the planet and they randomly go through all of them, notifying you when your site is unavailable or the result doesn't match the expectations.

Pingdom

Business Level

These aren't necessarily metrics related to your application's or infrastructure's availability, they're more along the lines of what your users are doing right now, or have done over the last month. Think number of new users per day, number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google Analytics or click paths (using tools like Hummingbird, for example) in general also fall into this category.

These kind of metrics may be more important to your business than to your infrastructure, but they're important nonetheless, and they could e.g. be integrated with another metrics collection tool, some of which we'll get to in a minute. Depending on what kind of data you're gathering they're also useful to analyze spikes in your application's performance.

This kind of data can be hard to track in a generic way. Usually it's up to your application to gather them and turn them into a format that's acceptable to a different tool to collect them. They're also usually very specific to your application and its business model.

Application Level

Digging deeper from the outsider's view, you want to be able to track what's going on inside of your application right now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are slow, what kinds of errors are being caused by your application, to name a few.

This will give you an overview of the innards of your code, and it's simply invaluable to have that kind of insight. You usually don't need much historical data in this area, just a couple of days worth will usually be enough to analyze problems in retrospect. It can't hurt to keep them around though, because growth also shows trends in potential application code hot spots or database queries getting slower over time.

To get an inside view of your application, services like New Relic exist. While their services aren't exactly cheap (most monitoring services aren't, no surprise here), they're invaluable. You can dig down from the Rails controller level to find the method calls and database queries that are slowest at a given moment in time (most likely you'll be wanting to check the data for the last hours to analyze an incident), digging deeper into other metrics from there. Here's an example of what it looks like.

New Relic

You can also use the Rails log file and tools like Request-log-analyzer. They can help you get started for free, but don't expect a similar, fine-grained level of detail like you get with New Relic. However, with Rails 3 it's become a lot easier to instrument code that's interesting to you and gather data on runtimes of specific methods yourself.

Other means are e.g. JMX, one of the neat features you get when using a JVM-based language like JRuby. Your application can contiuously collect and expose metrics through a defined interface to be inspected or gathered by other means. JMX can even be used to call into your application from the outside, without having to go through a web interface.

Application level monitoring also includes exception reporting. Services like Exceptional or Airbrake Bug Tracker are probably the most well known in that area, though in higher price regions New Relic also includes exception reporting.

Process Level

Going deeper (closer to inception than you think) from the application level we reach the processes that serve your application. Application servers, databases, web servers, background processing, they all need a process to be available.

But processes crash. It's a bitter and harsh truth, but they do, for whatever reason, maybe they consumed too many resources, causing the machine to swap or the process to simply crash because the machine doesn't have any memory left to allocate. Think of a memory leaking Rails application server process or the last time you used RMagick.

Someone must ensure that the processes keep running or that they don't consume more resources than they're allowed to, to ensure availability on that level. These tools are called supervisors. Give them a pid file and a process, running or not, and they'll make sure that it is. Whether a process is running can depend on multiple metrics, availability over the network, a file size (think log files) or simply the existence of the process, while allowing you to send some sort of grace period, so they'll retry a number of times with a timeout before actually restarting the process or giving up monitoring it altogether.

A good supervisor will also let you alert someone when the expected conditions move outside or their acceptable perimeter and a process had to be restarted. A classic in this area is Monit, but people also like God and Bluepill. On a lower level you have tools like runit or upstart, but their capabilities are usually built around a pid file and a process, not allowing to go on a higher level of checking system resources.

While I find the syntax of Monit's configuration to not be very aesthetically pleasing, it's proven to be reliable and has a very small footprint on the system, so it's our default on our own infrastructure, and we add it to most our cookbooks for Scalarium, as it's installed on all managed instances anyway. It's a matter of preference.

Infrastructure/Server Level

Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area can't be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the data efficiently.

This data is very important in several ways. For one, it will show you what your servers are doing right now, or in the last couple of minutes, which is usually enough to notice a problem. Second, the data collected is very useful to discover trends, e.g. memory usage increasing over time, swap usage increasing, or a partition running out of disk space. Any value constantly increasing over time is a good sign that you'll hit a wall at some point. Noticing trends will usually give you a good indication that something needs to be changed in your infrastructure or your application.

Munin

There's a countless number of tools in this area, Munin (see screenshot), Nagios, Ganglia, collectd on the open source end, and CloudKick, Circonus, Server Density and Scout on the paid service level, and an abundance of commercial tools on the very expensive end of server monitoring. I never really bother with the commercial ones, because I either resort to the open source tools or pay someone to take care of the monitoring and alerting for me on a service basis. Most of these tools will run some sort of agent on every system, collecting data in a predefined cycle, delivering it to a master process, or the master processing picking up the data from the agents.

Again, it's a matter of taste. Most of the open source tools available tend to look pretty ugly on the UI part, but if the data and the graphs are all that matters to you, they'll do just fine. We do our own server monitoring using Server Density, but on Scalarium we resort to using Ganglia as an integrated default, because it's much more cost effective on our users, and given the elastic nature of EC2 it's much easier for us to add and remove instances as they come and go. In general I'm also a fan of Munin.

Most of them come with some sort of alerting that allows you to define thresholds which trigger the alerts. You'll never get the thresholds right the first time you configure them, constantly keep an eye on them to get a picture of what thresholds are normal, and which are indeed problem areas and require an alert to be triggered.

The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect business level data, utilizing the existing graphing and even alerting capabilities.

Log Files

The much dreaded log file won't go out of style for a long time, that's for sure. Your web server, your database, your Rails application, your application server, your mail server, all of them dump more or less useful information into log files. They're usually the most immediate and uptodate view of what's going on in your application, if you chose to actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background services sure are, or any other service running on your servers. The log is the first to know when there's problems delivering email or your web server is returning an unexpected amount of 500 errors.

The biggest problem however is aggregating the log data, centralized logging if you will. syslog and all the alternative tools are traditionally sufficient, while on the larger scale end you have custom tools like Cloudera's Flume or Facebook's Scribe. There's also a bunch of paid services specializing on logging, most noteworthy are Splunk and Loggly. Loggly relies on syslog to collect and transmit data from your servers, but they also have a custom API to transmit data. The data is indexed and can easily be searched, which is usually exactly what you want to do with logs. Think about the last time you grepped for something in multiple log files, trying to narrow down the data found to a specific time frame.

There's a couple of open source tools available too, Graylog2 is a syslog server with a MongoDB backend and a Java server to act as a syslog endpoint, and a web UI allowing nicer access to the log data. A bit more kick-ass is logstash which uses RabbitMQ and ElasticSearch for indexing and searching log data. Almost like a self-hosted Loggly.

When properly aggregated log files can show trends too, but aggregating them gets much harder the more log data your infrastructure accumulates.

ZOMG! So much monitoring, really?

Infrastructure purists would start by saying that there's a different between monitoring, metrics gathering and log files. To me, they're a similar means to a similar end. It doesn't exactly matter what you call it, the important thing is to collect and evaluate the data.

I'm not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is however one reason why eventually you'll want to have most if not all of them. At any incident in your application or infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow queries, or problems introduced by recent deployments.

For example, your site's performance is becoming sluggish in certain areas, users start complaining. Application level monitoring indicates specific actions taking longer than usual, pointing to a specific query. Server monitoring for your database master indicates an increased number of I/O waits, usually a sign that too much data is read from or written to disk. Simplest reason could be an index missing or that your data doesn't fit into memory anymore and too much of it is swapped out to disk. You'll finally be looking at MySQL's slow query log (or something similar for your favorite database) to find out what query is causing the trouble, eventually (and hopefully) fixing it.

That's the power of monitoring, and you just can't put any price on a good setup that will give you all the data and information you need to assess incidents or predict trends. And while you can set up a lot of this yourself, it doesn't hurt to look into paid options. Managing monitoring yourself means managing more infrastructure. If you can afford to pay someone else to do it for you, look at some of the mentioned services, which I have no affiliation with, I just think they're incredibly useful.

Even being an infrastructure enthusiast myself, I'm not shy of outsourcing where it makes sense. Added features like SMS alerts, iPhone push notifications should also be taken into account. Remember that it'd be up to you to implement all this. It's not without irony that I mention PagerDuty. They sit on top of all the other monitoring solutions you have implemented and just take care of the alerting, with the added benefit of on-call schedules, alert escalation and more.