Building an Infrastructure Service on Top of Infrastructure Services

For the last two years, I’ve been working on Travis CI, a hosted continuous integration and deployment platform. It first started out as a free service for open source projects on GitHub, but has since then evolved into a hosted product for private projects as well.

The fascinating bit for me was, right from the start, that the entire platform, inarguably an infrastructure product, is built and runs on top of other infrastructure products.

Travis CI, for the most part, runs on infrastructure managed and operated by other people.

Most of the code runs on Heroku, our RabbitMQ is hosted by CloudAMQP, our database is run by Heroku Postgres, our build servers are managed by Blue Box, our logs go into Papertrail, our metrics to Librato, our alerts come from OpsGenie, our status page is hosted on StatusPage.io, even our Chef server, the one bit that we use to customize some of our servers, is hosted.

In a recent Hangops, we talked about buying vs. building infrastructure. I thought it’s well worth elaborating more on these things, why we went on to build Travis CI on top of other infrastructure products rather than build and automate these things ourselves.

Operation Expenses vs. Time Spent Building

The most obvious reason why you’d want to buy rather than build is to save time.

In a young company trying to build out a product, anything that saves you time but adds value is worth paying for.

You can build anything, if you have the time to do it.

This is an important trade-off. You’re spending money, possibly on a monthly basis to use a service rather than spend the time building it yourself.

A status page is a classic example. Surely it should be doable to build it myself in just a few days, yes?

But then, to be really useful, your custom status page needs an API, for easy interaction with your Hubot. Maybe you also want a way to integrate metrics from external sources, and you want it to include things like scheduled maintenances.

On top of that, (hopefully) a pretty user interface.

That’s the better of two weeks, if not more than that. On top of that, you need to run it in production too. It’s one more distraction from running your core product.

Other people may be able to do this a lot better than you. They help you free up time to work on things that are relevant to your products and your customers.

In return, you pay a monthly fee.

Surely, you say, building it yourself is practically free compared to a monthly fee, isn’t it?

Your time is very valuable. It’s more valuable spent on your own product rather than build other things around it.

The problem with time spent building things is that you can’t put a number on it. You’re basically comparing a monthly fee to what looks like a big fat zero. Because heck, you built it yourself, it didn’t cost anything.

This is the classic tradeoff of using paid services. You could very well build it yourself, but you could also spend your time on other things. You could also use and run an open source solution, but that too needs to be maintained, operated and upgraded.

If this sounds one-sided, that’s unintentional. I have a history of building things myself, racking my own servers, provisiong them myself.

But there are things to keep in mind when it comes to spending time building things. There’s a non-zero cost attached to this, it’s just not visible as the monthly invoice you’re getting from a service. That cost is hard to fathom as it’s hard to put a numeric value on the time spent building it.

When you have the resources and can afford to, it makes sense to start pulling things in-house.

For us, not having to take care of a big chunk of our infrastructure ourselves is a big benefit, allowing us to focus on the more relevant bits.

But letting other folks run core parts of your infrastructure doesn’t come without risks either.

Risks of Downtime and Maintenance

When you buy into the idea of someone else maintaining more or less vital parts of your infrastructure, there’s a risk involved.

You’re bound to any problems they might have with their infrastructure, with their code. In multi-tenant systems, any operational issues tend to ripple through the system and affect several customers rather than just one.

You’re also bound to their maintenance schedules. Amazon’s RDS service, for this particular scenario, allows you to specify a maintenance window through their API for your database instances.

The full risk of how this affects your own application is hard, if not impossible, to calculate.

A part of your infrastructure could go down at any time, and it’s mostly out of your hands to respond to it. What you can and should do, is harden your code to work around them, if at all possible.

One question to ask is how vital this particular piece of infrastructure is to your application and therefore, your business.

If it’s in the critical path, if it affects your application significantly when it goes down, there are options. Not all multi-tenant systems are automatically multi-tenant. Some offer the ability to have dedicated but managed setups. Some even offer high availability capabilities to reduce the impact of single nodes going down.

Both our PostgreSQL database and our RabbitMQ setup are critical parts of Travis CI. Without the database, we can’t store or read any data. Without our message queue, we can’t push build logs and build jobs through the system, effectively leaving the system unable to run any tests for our customers.

We started out on multi-tenant setups for both. On our PostgreSQL database, the load was eventually way too high for the small size of the database setup.

For our RabbitMQ, we were easily impacted by other clients in the system. RabbitMQ in particular can be gnarly to work with when lots of clients share the same cluster. On client producing an unusual amount of messages can grind everyone else in the system to a grinding halt.

Eventually, we ran both parts on dedicated infrastructure, but still fully managed. There’s still a chance of things going down, of course. But the impact is less than if an entire multi-tenant cluster goes down.

Putting parts that were in the critical path on dedicated infrastructure has been working well for us. The costs certainly went up, but we just couldn’t continue telling excuses on why Travis CI was down.

When it comes to buying into other people running your infrastructure, don’t be afraid to ask how they manage it. Do they have a status page that is actively used? How do they handle downtimes?

Operational openness is important when other people manage parts of your infrastructure.

It’s inevitable that something bad will happen in their infrastructure that affects you. How they deal with these scenarios is what’s relevant.

Security and Privacy

With multi-tenant infrastructure, you’re confronted with curious challenges, and they can affect you in ways that only studying your local laws and the provider’s terms of service will fully reveal.

Security and privacy are two big issues you need to think about when entrusting your data to a third party. The recent MongoHQ security incident has brought up this issue in an unprecendented way, and we’ve had our own issues with security in the past.

Note that these issues could come up just the same when you’re running your own infrastructure. But just like outages, security and privacy breaches can have much wider ranging ripple effects on multi-tenant infrastructure.

How can you best handle this? Encrypting your data is one way to approach the situation. Encrypt anything that’s confidential, that you want to protect with at least one small extra layer of security to reduce the attack surface on it.

We encrypt SSH keys and OAuth tokens, the most private data that’s entrusted to our systems. Of course, the keys aren’t stored in the database.

When buying infrastructure rather than building it, keep a good eye on what your providers do and how they handle security and your data. This is just as important as handling outages, if not even more so.

Make sure that your privacy/security statements reflect which services you’re using and how you handle your customers’ data with them. It may not sound like much, but transparency goes a long way.

One unfortunate downside of infrastructure services, Heroku add-ons come to mind, is the lack of fine-grained access privileges. Only some of the addons we use allow us to create separate user accounts with separate permissions.

It’s one of the downsides of the convenience of just having a URL added to your application’s environment and start using an add-on.

Judging the impact of the trade-off is, again, up to you. Sometimes convenience trumps security, but other times (most times?), security is more important than convenience.

Your users’ data is important to your users, so it should be just as important to you.

Scaling up and out

We started out small, with just a few Heroku dynos and a small database setup, a shared RabbitMQ setup to boot.

In fact, initially Travis CI ran on just one dyno, then two, then just a few more when a second application was split out.

This worked up to a few thousand tests per day. But as we scaled up, that wasn’t sufficient.

I was sceptical at first whether we can scale up while remaining on managed infrastructure rather than build our own. Almost two years later, it’s still working quite well.

Important bits have been moved to dedicated setups, the databases (we have four clusters, eight database servers in total) and our RabbitMQ service, which we needed to move to a cluster setup.

Most hosted services give you means to scale up. For Heroku apps, you add more dynos, or you increase the capacity of a single dyno.

For their databases (or Amazon RDS, for that matter), upgrade the underlying server, simple enough to do. For RabbitMQ, go for a bigger plan that gives more dedicated resources, higher throughput, and the like.

Figuring out the limits of hosted infrastructure services is hard. If you send a log service, even by mistake, thousands of messages per second, how do they respond? Can they handle it?

Only one way to find out, ask them!

With most of the bits that we need to scale out, we’re confident that hosted services will give us the means to do so for quite some time. After that, we can still talk to them and figure out what we can do beyond their normal offerings.

Scaling up is a challenge, as Joe Ruscio put it on the aforementioned hangops episode: “Scaling is always violent.”

It was violent on occasion for us as well.

We may need more dedicated bits in the future for specialized use, things like ZooKeeper for distributed consensus. But most of our tools are still running nicely on hosted infrastructure.

Operational insight

One thing that’s been bugging me about a few of our core services originally was the lack of operational insight.

With infrastructure beyond your control, getting insight into what’s happening can be challenging.

We had to ask Heroku support quite a few times for insight into our database host machine. For figuring out whether or not an upgrade to a larger plan or instance is required, this can be essential. It certainly was for us. This situation has been improving and it will be even more in the future from what I’ve heard.

But for an infrastructure provider, offering this kind of insight can also be challenging. Heroku’s Postgres has improved quite a lot, and we get better insight into what’s happening in our database now thanks to datascope and their means of dumping metrics into the logs, which you can then aggregate with a service like Librato.

Most providers have great people working for them. When in doubt, ask them about anything that’s on your mind. The services we work with are usually very helpful and insightful. The Heroku Postgres team is a knowledge goldmine in itself.

Tags

Operation Expenses vs. Time Spent Building

Risks of Downtime and Maintenance

Security and Privacy

Scaling up and out

Operational insight