It seems to be common sense these days to throw a search engine, a database slave or an external cache at your application in even the earliest stages of a project. Let me just give you my two cents on the issue, and why I think that's a bad idea.

It all starts with the angst that the application won't scale when it's hit by whatever effect the cool kids use these days. Even having only several thousand records already seem to be an indication that there's not enough room to scale the database itself.

Let me just go ahead and say it: Your way of thinking about scaling is wrong. You're raising the bar of complexity without having a problem. With every service introduced to the project, a new error source needs to be handled. Your code needs to deal with potential MySQL slave replication lags, cache expiration, constant search index updates, and of course the administrative overhead.

Let's look at all three in a little more detail.

MySQL Slave/Sharding

The load on your database has reached levels you think aren't manageable without introducing a reading slave into your setup anymore. Most of the time, you're wrong.

Is your load too high? Find statements that require too much IO and make them fast. Does your database touch the hard drives too much? Increase the cache size and add more RAM. The more data and indexes you can store in-memory, the less hard drive activity is required. Caching sure is king, but you're winning the most when you can let the database do most, if not all of it.

Do inserts/updates take too long? Check your indexes, you might have too many of them. Increase the interval that InnoDB will flush transactions to the log in. Unless you have a really, really critical application, you can live with the prospect of losing one second worth of data.

Adding a slave looks easy. You dump the master, load the dump on the slave, enable replication, and you're golden. But what if your slave goes down? What if the replication lacks so far behind that it would impact what the user sees? Does your application handle the issue that data written to the master might not be visible on the slave immediately?

In the end, it's the same as with an external cache, you still need to tune your database. You'll gain a lot more by trying to find some of the big resource-hogging queries in your application, setting some indexes or fine-tuning one or two database parameters. What's that you're gaining, you ask? Why, it's simplicity.

Adding RAM and increasing the database's cache is a lot cheaper than trying to improve performance in other parts of your application. Calculate the cost of say 8 GB of RAM against the cost of one person working for a whole week on trying to tweak things in your application that could easily be improved with just a little bit more RAM. Whether you like it or not, the database is your bottleneck, but tuning it is not rocket science. Some solutions to common database problems (which you have, unless you're Big Company) are so simple it still hurts my brain how uncommon the knowledge of them is.

Don't believe me? Have a look at this recent post on the MySQL Performance Blog.

Sharding might eventually make sense when splitting up your application is the only way out. When you have five billion visits per months, then you can go ahead and shard your databases. But right from the start? Meh, throw RAM at it as long as you can, saves a lot of trouble.

Memcache

Your queries are becoming slow, and therefore your website's response time decreases. One thing that pops up will be to cache the hell out of your website. After all, Rails has neat support for caching.

That's all great, but you need to be aware of two things. First, caching is hard. Getting the data in and out is the easy part, knowing when to expire the data is what will break your neck, it not just might, it will.

Second, it will not spare you from tuning your database. Imagine your cache goes down. Sure, your application handles the missing cache gracefully, but all those slow queries you wanted to get rid of, they're in full effect. Caching is not the solution to slow database queries. You still need to invest time in fine-tuning slow statements and your database.

Memcache is awesome, and it's a great way to reduce your application's response time, but it's a level of complexity that needs to be understood, handled and tested correctly and gracefully.

All that said, I prefer turning to Memcached before I consider replication or sharding anytime.

Full Text Search

A common meme seems to be that you don't want to do full-text search using your database. Throw a separate full-text search engine at your application instead. But hold on, isn't the purpose of a database to find data easily? Yeah, I vaguely remember that. So why not use it? Because it's slow? Oh, come on. You have like what, some ten thousand records in your database? Are all of them using a largetext column? No? Good, then why not use a simple select using like? It's not horribly wrong, you know. Worry about setting up full text search when you hit a wall with that. But unless you have hundreds of thousands of rows or larger documents you want to query on, you'll be fine.

Don't get me wrong, I love me a good full-text search, but I don't throw it at every application just because I want to search for something, even in text. If you use PostgreSQL, look into tsearch2, using MySQL, maybe the included full-text search will do.

The bottom line is simple: Don't throw complex solutions at problems that can be solved either by using simple tweaks, or just because other solutions using the tools you already have might require some more thinking and analysis. The more detail you know about a specific problem, the better your judgement on the solution will be, and usually, the solution can be as simple as tweaking your database. That's your biggest bottleneck, so tune the heck out of it.

There's something I see in lots of projects is an overuse of self. Sure, it looks a lot nicer than this, but its overuse can clutter code quite easily. Here's a rather simple example.

def published?
 !self.deleted_at? && self.published_at?
end

Others use self even for just calling a method from another method. Why's that again? I just don't get it, it feels unnecessary, and is just five characters too much for every usage.

Sometimes I think that the programmer who wrote the code either doesn't understand the concept of the current scope of self and when using it is necessary, or is coming from Python (no offense Python, I still like you, but I'm not so fond of your usage of self). Let's remedy that, shall we?

Some also claim that it improves clarity, for the courtesy of other programmers working on the code, and to make it clearer in what context the method is called. If you need to make that clear, your method is simply too long.

The biggest confusion surrounding the usage of self from within a method stems from the different handling of local variables and method lookup. Consider the following code:

def publish
  self.published_at = Time.now
  save
end

It's a totally valid piece of code, and one I can totally get on board with, because it does what it's supposed to do. In my early days with Ruby, I used to write the code like this:

def publish
  published_at = Time.now
  save
end

Then I wondered why the instance variable published_at wasn't saved. The answer is simple. When you have something that looks like local variable, and you assign to it, Ruby will obey and create a new local variable called published_at, no matter if you have a write accessor defined that looks exactly like it. It will go out of scope as soon as the program's flow leaves the method.

But what about save? Ruby will first look for a local variable. Now, I'm not arguing that you shouldn't have a local variable called save anywhere in your program, and if you do you might want to rethink that. But since, for now, there is no local variable with that name, Ruby will turn to its method lookup mechanism to resolve the identifier in question. If it can't find anything you'll get the much loved error NoMethodError.

If clarity is what you're longing for, learn Ruby's rules of resolving local variables and methods. Will make your life much easier.

So what is Ruby doing in the former version of the method body? You're pretty much forcing it to go directly to the method lookup. With the accessor magic it will find a method published_at=(published_at) and call it. Easy.

I've also seen modifications of this one:

def publish
  self.published_at = Time.now
  self.save
end

Or how about this one:

def publish
  self.published_at = Time.now
  self.publisher = self.user
  self.save
end

What's up with that? Pretty self-involved if you ask me. It's like using self just for the sake of it. Using self when assigning to instance variables, so might as well use it everywhere in the method for consistency, right?

Now, imagine that piece of code also having a call to a private or protected method in it. Of course you can't call those directly on an object, only on an implicit self:

def publish
  self.published_at = Time.now
  update_trackbacks
  self.save
end

Gross! The code looks more and more confusing, and I don't appreciated confusing looking code.

Of course, if you're using ActiveRecord, why not save a full line?

def publish
  update_attribute(:published_at, Time.now)
end

So let's review the initial example. We're only doing method calls, and Ruby will figure out our intention all by itself. So how about the simple version:

def published?
 !deleted_at? && published_at?
end

Wow, so simple. Much easier on the eyes, and the intention is clear right from the start. My rule is simple: When assigning to an instance variable, use self, calling a method on the other hand should stand all by itself. Now, you could argue, that assigning to an instance variable using its accessor is also a method call, but if you really want to argue about that, you should really read this blog entry again.

Tags: ruby

I got a little sick of having to maintain an unmaintained blog engine (I'm looking at you SimpleLog), so I did what all the cool kids are doing, I switched the blog to using Tom Preston-Werner's excellent Jekyll. I threw in a few tweaks of my own, and tweaked my Rakefile and Apache to support things that SimpleLog can do, but the static nature of Jekyll can't. Thankfully, the URL format of both is pretty similar, so it was pretty easy to set up redirects from the old URLs to the new ones.

For comments, I welcome Disqus.

The full blog is on GitHub, feel free to look at the deployment scripts and the build file. The latter uses Jekyll as a library to generate pages for all topics used in the posts. Nothing short of rocket science, I'm telling you.

Why you ask? Because a simple text file wins over a database system any day, and using cap deploy to update a blog entry is just outright cool.

Tags: blog, miscellany