Fork me on GitHub

Update: Read the comments and below. The issue is not as bad as it used to be in the documentation and the original design, thankfully.

A lot has happened since I've first written about MongoDB back in February. Replica Pairs are going to be deprecated, being replaced by Replica Sets, a working Auto-Sharding implementation, including rebalancing shards, and lots more, all neatly wrapped into the 1.6 release.

The initial draft on how they'd turn out sounded good, but something struck me as odd, and it is once again one of these things that tend to be overlooked in all the excitement about the new features. Before we dive any deeper, make sure you've read the documentation, or check out this rather short introduction on setting up a Replica Set, I won't go into much detail on Replica Sets in general, I just want to point out one major issue I've found with them. Part of the documentation sheds some light on the inner workings of Replica Sets. It's not exhaustive, but to me more interesting than the rest of the documentation.

One part struck me as odd, the paragraph on resyncing data from a new primary (as in master). It's two parts actually, but they pretty much describe the same caveat:

When a secondary connects to a new primary, it must resynchronize its position. It is possible the secondary has operations that were never committed at the primary. In this case, we roll those operations back.

Also:

When we become primary, we assume we have the latest data. Any data newer than the new primary's will be discarded.

Did you notice something? MongoDB rolls operations back that were never committed to the primary, discarding the updated data, which is just a fancy term for silently deleting data without further notice. Imagine a situation where you just threw a bunch of new or updated data at your current master, and the data has not yet fully replicated to all slaves, when suddenly your master crashes. According to the protocol the node with the most recent opslog entries takes over the primary's role automatically.

When the old master comes back up, it needs to resynchronize the changes from the current master, before it can play any role in the set again, no matter if it becomes the new primary, or sticks to being a secondary, leaving the new master in place. During that resync it discards data that has not been synchronized to the new master yet. If the opslog on the new master was behind a couple of dozen entries before the old one went down, all that data is lost. I repeat: lost. Think about that.

There's ways to reduce the pain, and I appreciate that they're mentioned appropriately in the documentation. You can tell MongoDB to consider a write successful when it replicated to a certain number of secondaries. But you have to wait until that happened, polling getLastError() for the state of the last operation. Or you could set maxLag accordingly, so that the master will fail or block a write until the secondaries catch up with the replication, though I couldn't for the life of me figure out (using the Googles) where and how to set it.

But I don't approve of this behavior as a default, and the fact that you need to go through the internals to find out about it. Everything else suggests that there's no point of failure in a MongoDB setup using sharding and Replica Sets, even comparing it to the Dynamo way of guaranteeing consistency, which it simply isn't when the client has to poll for a successful write.

It's one of those things that make me reconsider my (already improved) opinions on MongoDB all over again, just when I started to warm up with it. Yes, it's wicked fast, but I simply disagree with their take on durability and consistency. The tradeoff (as in: losing data) is simply too big for me. You could argue that these situations will be quite rare, and I would not disagree with you, but I'm not fond of potentially losing data when they do happen. If this works for you, cool! Just thought you should know.

Update: There's been some helpful comments by the MongoDB folks, and there's good news. Data is not silently discarded in 1.6 anymore, apparently it's stored in some flat file, fixed with this issue, though it's hard for me to say from the commits what exactly happens. The documentation does not at all reflect these changes, but improvements are on the way. I'm still not happy about some of the design decision, but they're rooted in the way MongoDB currently works, and changing that is unlikely to happen, but at least losing data doesn't seem to be an option anymore. If making a bit of a fool out of myself helped to improve on the documentation front, so be it. I can live with that.

Hi, I'm Mathias, and I'm a CouchDB user. I've been using it for almost a year now, and we have a project using it in production, with a side of Redis. I think it's an awesome database, some of its features are simply unrivaled. Offline replication, CouchApps, to name a few. CouchDB just hit version 1.0. It's been a long time coming, with CouchDB having probably one of the longest histories in the non-relational database space. I've heard about it first back in September 2008, when Jan Lehnardt talked about it at a local co-working space. I still blame him for getting me all excited about this whole NoSQL thing. Fun fact: I bookmarked the CouchDB website back in February 2008.

The features being added to it with every release are nothing short of exciting. CouchDB 0.11 got filtered replication, support for URL rewriting and vhosts, amongst other things. But there's still some things that annoy me, that somewhat bug me in my daily work with it.

The following things are not incredible pet-peeves I have with CouchDB. I think CouchDB is pretty awesome, and I really like using it. However, it doesn't come without the occasional odditity that will leave you scratching your head. These probably aren't the only things to be aware of, they're just the most annoying to me. Your mileage may vary. They may or may not be annoying to you, but they're things that are good to know working with CouchDB. Whether CouchDB should or should not have what I'm listing here is a whole different story. It's my wishlist of improvements, if you will.

It's also stuff you're buying into when you move off the beaten path of relational databases. As always, some of these are not hard to find out, some of them do only get really annoying once you're moving into production, or when you get a deeper knowledge of the tool at hand. Nothing specific to CouchDB here, but some of the issues listed below stem from actively using it. Take them with a grain of salt. While they may seem annoying at first, they're things you can live with. Believe me, you can.

Views are updated on read access

You can dump in as many documents as you want, and you can create as many map/reduce views as you want. The truth is, they'll only come all together to slow down your application when you're querying the view. Assume you have a good stash of documents in your database, and you decide you need a new view on your data. Throw in the JavaScript functions and go ahead and query the view. Calling it a slow-down may be a stretch at times though, it really depends on how often your data is updated.

CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.

On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read. A way around this would of course be to keep your views warm by reading them regularly, e.g. through a cron job.

When you add new views, be sure to pre-warm them before you first access them in your application. One way would be to add the views at a time where you database isn't accessed as much. It doesn't block all access to the documents, but it sure has a certain impact on your database's performance, and of course the first requests that may time out because CouchDB is building the requested views in the background.

When it comes to just updating a view, and it might take too long, you can set the parameter stale=ok. That way, even if the view data needs to be updated, CouchDB won't update it and just return the last known state of the view's B-tree.

That's all fun and giggles, but when on earth are you supposed to actually update your view? Always reading stale data is not great? I've gotten some odd suggestions when I complained about this elsewhere, but in the end I just want to tell the database that I'm okay with stale data, but that it should update the view in the background.

No automatic compaction

As your database grows and data gets updated, CouchDB leaves old and stale data untouched, appending new data (inserted and updated documents are considered new data) to the end of its database files, a fact that's also true for view files. That has the neat advantage that you can still access old revisions of your documents, but it will also leave your database files growing constantly. Now, depending on the number of documents and updates on them, that might not be a big deal, but it's a good idea to start regular compaction earlier than later.

Riak's Bitcask file backend has a neat way of automatically compacting its files. It appends data in a similar manner as CouchDB, but can determine if a node in the cluster can run compaction on its data, and do so automatically, without much need for human intervention. It'd be nice to have something similar as part of CouchDB without having to run cron jobs to do that.

The append-only mechanism makes CouchDB bullet-proof, no doubt, you'll always have consistent data files on your hard disk, backups are as simple as copying the files elsewhere, or take an EBS volume snapshot at any time. But that level of data consistency comes with a price, and that's an ever-growing data file.

No partial updates

Whenever you update a document in CouchDB, you update it as a whole, there's nothing in between. That kind of makes sense with the way CouchDB works, but as a user it annoys me from time to time. It seems so pointless fetching and sending a whole document when I'm just updating one attribute. There's a neat RFC for the PATCH command in HTTP making the rounds, I'd love to see that end up in CouchDB at some point. No idea how likely that is, the makers of CouchDB have a weird aversion to using diffs to update data.

Note that I'm not talking about the MongoDB way of setting attributes atomically. I don't need that, because it simply doesn't scale well, especially not with the CouchDB storage model, and you're not updating data in-place like MongoDB. It's more about just being able to send a diff or a minor update than a whole document.

You can somewhat fake this using update handlers (look at the view called "in-place") from CouchDB 0.10 on. It's pretty neat, but it's just not the same.

No built-in way to scale up

CouchDB's replication is unrivaled, no doubt. Being able to replicate any database with any other database at any point in time makes CouchDB unique, some say it's the killer feature, and I concur. There's lot of argueing whether or not that defines CouchDB as being distributed. In the most traditional sense, at least to me, it sure does, but I'm not here to nitpick about that. It's easy to scale out by adding more nodes and setting them up to constantly replicate with each other, make anyone a master or slave as you like. But there's no way to distribute write and read access across a cluster of nodes.

CouchDB-lounge has been the traditional way to approaching that, but I never really liked it, because it added more components to the infrastructure. Something like that should really be built in. The good news is that Cloudant is planning on open-sourcing their clustering solution Open Cloudant, which will then hopefully become part of CouchDB. A quorum based system for CouchDB would be neat, and it doesn't seem too far away.

Pagination is awkward

CouchDB's B-tree is a leaky abstraction, that's the conclusion I came to at some point. It has a pretty big impact on your application's code, and that's not necessarily a bad thing. Suddenly you deal with things like conflicts, or simply updating views on reads. But no other part of your web application will make that as obvious as pagination, a pretty common and natural part of a web application.

The path of least resistence to get pagination is to use the skip and limit parameters, but it's not recommended, as you'll still be walking the whole B-tree to determine the number of documents that must be skipped before it can collect the ones you're interested in.

The recommended way to do pagination is a bit awkward if you ask me. There's a good explanation in the CouchDB book, so I'll spare you repeating it here. But be sure to read it, because understanding that takes you half way to understanding the B-tree. It may be awkward, and very different from what you're used to, but that's how the B-tree works. It's not always unicorns and rainbows, sometimes it kinda gets in your way. Trade-offs, meh.

The simpler alternative would of course be to just use endless pagination, where you let the users just click a more button instead of clicking through the pages, because you know the last document displayed in your list, and the key that was used to fetch it. You simply use that key and the last document's id to step directly into the B-tree where you left off. You need to remember to fetch one additional document, as CouchDB will return the last document too, or you can just skip one document, which is acceptable, as skipping just one leaf in a tree is an operation of predictable performance.

Range queries are awkward

To do a range, you have to specify a start and an end key. That's the simple part. It starts getting awkward when your keys get slightly more complex, e.g. when your map function emits arrays. Assume you want to fetch all elements where the first part of the array matches a particular key, and the second part doesn't matter, e.g. when you emitted a timestamp as the second part to keep a natural (in terms of last update for example) order.

Assume your keys look like this: ['123', '2010/07/21'], that's the key format SimplyStored uses to manage associations between documents. To get the range that only matches the first part of the key, your startkey has to look like this: ['123']. This will match all documents having the above key. If you don't specify an endkey, CouchDB will simply return all documents following that key, so you need to specify an endkey. The recommended way to do that is to use the following format: ['123', {}]. That way you'll get all documents matching the first part of the key, because {} is considered to be greater than any string you may have emitted. See the CouchDB wiki on more details on this technique called view collation.

Obviously it's not impossible to do range queries in CouchDB, but it's slightly awkward. It all goes downhill as soon as you want to fetch only a particular subrange of the original one, using startkeydocid or endkeydocid, say for pagination. With the above ranges, they simply don't work. Both need a startkey and endkey that is an exact match. The whole point of the above range query is not to care about the exact start and end key, isn't it?

No CommonJS available in MapReduce functions

With CouchDB 0.11, CommonJS and all its awesomeness became available in view functions. I was pretty excited about it, and I still am. However, map and reduce functions were left out in the cold. Every time I have to write the same piece of JavaScript in a map or reduce function that I've used elsewhere already, I get bitter about that. Sometimes it's just very basic stuff that I could easily solve by throwing an existing library at it, but instead I'm cluttering my view code with it over and over again. And yes, there's the !code placeholder, but it's not about throwing an undebuggable mess of code into my view function, it's about not repeating myself. !code doesn't really solve that problem good enough for me.

Word is that it's got something to do with determining whether files have updated or not, but hey CouchDB, why don't you let me worry about that and let me tell you when I think a file I've included through CommonJS has been updated? I would very much appreciate that.

No link-walking between documents

With CouchDB 0.11, map functions got a way to emit other documents using {_id: doc.other_id}, but that still doesn't allow full access to e.g. attributes of said documents. Sometimes that'd just be handy to have. Sure, you could use embedded documents, but in that case it'd just be a dumb workaround, where I could just have a way to fetch a document by its identifier and throw some of its attributes at the map function.

Say what you will though, just being able to emit other documents is still pretty cool. Makes querying and fetching associated documents a bit easier.

All reads go to disk

CouchDB doesn't cache anything. It does delay commits if you want it to, so that it doesn't hit the disk on every document update, but it sure as heck doesn't cache anything in memory. This is both curse and blessing. It keeps the memory footprint of CouchDB incredibly small, no doubt. Considering they're targeting mobile devices it makes a lot of sense, plus, accessing flash-based storage is a lot cheaper than spinning disks.

But, on the other hand, when I have the memory available, why not use it? I know caching is a hard problem to solve. CouchDB is also made for high concurrency, no doubt, but my disks aren't necessarily. Sure, I could buy faster disks, but if you really think about it, memory is the new disk, plus, tell Amazon to offer faster network storage for EC2, please do, maybe that'd already help. CouchDB somewhat relies on the file system cache doing its magic to speed up things, but I really don't want to rely on magic. You could put an HTTP-level reverse proxy like Varnish in front of CouchDB though, that'd be a feasable option, but that adds another layer to your infrastructure.

In all seriousness, I'd love to see some caching introduced in CouchDB. I won't say it's an easy feature to implement, because it sure isn't, but it doesn't need to be something fancy, I just would like to see CouchDB use some of my memory for data that's read more often than it's written. But until then, Varnish to the rescue!

Error messages are not helping

I'm just gonna post the following snippet from my CouchDB log file, and leave you to it. You tell me how useful it is. Suffice it to say, I just wish CouchDB would not dump all that Erlang trace into my log, but maybe a useful error message for a change. It works in some cases, but a lot of times, when the problem usually is as simple as a permissions problem, you're left scratching your head.

{<0.84.0>,supervisor_report,
 [{supervisor,{local,couch_secondary_services}},
  {errorContext,start_error},
  {reason,
      {'EXIT',
          {undef,
              [{couch_auth_cache,start_link,[]},
               {supervisor,do_start_child,2},
               {supervisor,start_children,3},
               {supervisor,init_children,2},
               {gen_server,init_it,6},
               {proc_lib,init_p_do_apply,3}]}}},
  {offender,
      [{pid,undefined},
       {name,auth_cache},
       {mfa,{couch_auth_cache,start_link,[]}},
       {restart_type,permanent},
       {shutdown,brutal_kill},
       {child_type,worker}]}]}}

The End

There you go, some annoying things about CouchDB. They're annoying, but I still like CouchDB a lot. It's stuff I can live it, it's stuff I can work around, it's stuff that doesn't have as big an effect in production as it may seem. The bottom line is, as always, evaluate your tools. The above list is not to be taken as a list of arguments purely against using CouchDB. Consider them a list of things you need to be aware of, that may or may not be acceptable compared to what you gain.

In the end, and any way you look at it, CouchDB still kicks butt.

Comments: (view/add your own) Tags: couchdb

By now it should be obvious that I'm quite fond of alternatives data stores (call them NoSQL if you must). I've given quite a few talks on the subjects recently, and had the honor of being a guest on the (German) heise Developer Podcast on NoSQL.

There's some comments and questions that pop up every time alternative databases are being talked about, especially by people deeply rooted in relational thinking. I've been there, and I know it requires some rethinking, and also am quite aware that there are some controversial things that basically are the exact opposite of everything you learned in university.

I'd like to address a couple of those with some commentary and my personal experience (Disclaimer: my experience is not the universal truth, it's simply that: my experience, your mileage may vary). When I speak of things done in practice, I'm talking about how I witnessed things getting done in Real Lifeā„¢, and how I've done them myself, both good and bad. I'm focussing on document databases, but in general everything below holds true for any other kind of non-relational database.

It's easy to say that all the nice features document databases offer are just aiming for one thing, to scale up. While that may or may not be true, it just doesn't matter for a lot of people. Scaling is awesome, and it's a problem everyone wants to solve, but in reality it's not the main issue, at least not for most people. Also, it's not an impossible thing to do even with MySQL, I've had my fun doing so, and it sure was an experience, but it can be done.

It's about getting stuff done. There's a lot more to alternative databases in general, and document databases in particular, that I like, not just the ability to scale up. They simply can make my life easier, if I let them. If I can gain productivity while still being aware of the potential risks and pitfalls, it's a big win in my book.

What you'll find, when you really think about it, is that everything below holds true no matter what database you're using. Depending on your use case, it can even apply to relational databases.

Relational Databases are all about the Data

Yes, they are. They are about trying to fit your data into a constrained schema, constrained in length, type, and other things if you see fit. They're about building relationships between your data in a strongly coupled way, think foreign key constraints. Whenever you need to add data, you need to migrate your schema. That's what they do. They're good at enforcing a set of ground rules on your data.

See where I'm going with this? Even though relational databases tried to be a perfect fit for data, they ended up being a pain once that data needed to evolve. If you haven't felt that pain yet, good for you. I certainly have. Tabular data sounds nice in theory, and is pretty easy to handle in Excel, but in practice, it causes some pain. A lot of that pain stemmed from people using MySQL, yes, but take that argument to the guy who wrote it and sold it to people as the nicest and simplest SQL database out there.

It's easy to get your data into a schema once, but it gets a lot harder to change the schema and the data into a different schema at a later point in time. While data sticks around, the schema evolves constantly. Something relational databases aren't very good at supporting.

Relational Databases Enforce Data Consistency

They sure do, that's what they were built for. Constraints, foreign keys, all the magic tricks. Take Rails as a counter-example. It fostered the idea that all that stuff is supposed to be part of the application, not the database. Does it have trade-offs? Sure, but it's part of your application. In practice, that was correct, for the most part, although I can hear a thousand Postgres users scream. There's always an area that requires constraints on the database level, otherwise they wouldn't have been created in the first place.

But most web applications can live fine without it, they benefit from being free about their data, to shape it in whichever way they like, adding consistency on the application level. The consistency suddenly lies in your hands, a responsibility not everyone is comfortable with. You're suddenly forced to think more about edge cases. But you sure as hell don't have to live without consistent data, quite the opposite. The difference is that you're taking care of the consistency yourself, in terms of your use case, not using a generic one-fits-all solution.

Relationships between data aren't always strict. They can be loosely linked, what's the point of enforcing consistency when you don't care if a piece of data still exists or not? You handle it gracefully in your application code if you do.

SQL is a Standard

The basics of SQL are similar, if not the same, but under the hood, there's subtle differences. Why? Because under the hood, every relational database works differently. Which is exactly what document databases acknowledge. Every database is different, trying to put a common language on top will only get you so far. If you want to get the best out of it, you're going to specialize.

Thinking in Map/Reduce as CouchDB or Riak force you to is no piece of cake. It takes a while to get used to the ideas around it and what implications it has for you and your data. It's worth it either way, but sometimes SQL is just a must, no question. Business reporting can be a big issue, if your company relies on supporting standard tools, you're out of luck.

While standards are important, in the end it's important what you need to do with your data. If a standard gets in your way, how is that helpful? Don't expect a standard query language for document databases any time soon. They all solve different types of problems in different ways, and they don't intend to hide that from you with a standard query language. If on the other hand, all you need is a dynamic language for doing ad-hoc queries, check out MongoDB.

Normalized Data is a Myth

I learned a lot in uni about all the different kinds of normalization. It just sounded so nice in theory. Model your data upfront, then normalize the hell out of it, until it's as DRY as the desert.

So far so good. I noticed one thing in practice: Normalized data almost never worked out. Why? Because you need to duplicate data, even in e-commerce applications, an area that's traditionally mentioned as an example where relational databases are going strong.

Denormalizing data is simply a natural step. Going back to the e-commerce example, you need to store a lot of things separately when someone places an order: Shipping and billing address, payment data used, product price and taxes, and so on. Should you do it all over the place? Of course not, not even in a document database. Even they encourage storing similar data to a certain extent, and with some of them, it's simply a must. But you're free to make these decisions on your own. They're not implying you need to stop normalizing, it still makes sense, even in a document database.

Schemaless is not Schemaless

But there's one important thing denormalization is not about, something that's being brought up quite frequently and misunderstood easily. Denormalization doesn't mean you're not thinking about any kind of schema. While the word schemaless is brought up regularly, schemaless is simply not schemaless.

Of course you'll end up with having documents of the same type, with a similar set of attributes. Some tools, for instance MongoDB, even encourage (if not force) you to store different types of documents in different collections. But here's the kicker, I deliberately used the word similar. They don't need to be all the same across all documents. One document can have a specific attribute, the other doesn't. If it doesn't, just assume it's empty, it's that easy. If it needs to be filled at some point, write data lazily, so that your schema eventually is complete again. It's evolving naturally, which does sound easy, but in practice requires more logic in your application to catch these corner cases.

So instead of running migrations that add new tables and columns, and in the end pushing around your data, you migrate the data on the next access, whether that's a read or a write is up to your particular use case. In the end you simply migrate data, not your schema. The schema will evolve eventually, but first and foremost, it's about the data, not the constraints they live in. The funny thing: In larger projects, I ended up doing the same thing with a relational database. It's just easier to do and gentler on the load than running a huge batch job on a production database.

No Joins, No Dice

No document database supports joins, simple like that. If you need joins, you have two options: Use a database that supports joins, or adapt your documents so that they remove the need for joins.

Documents have one powerful advantage: It's easy to embed other documents. If there's data you'd usually fetch using a join, and that'd be suitable for embedding (and therefore oftentimes: denormalizing), there's your second option. Going back to the e-commerce example: Whereas in a relational database you'd need a lot of extra tables to keep that data around (unless you're serializing it into single column), in a document database you just add it as embedded data to the order document. You have all the important data one in place, and you're able to fetch it in one go. Someone said that relational databases are a perfect fit for e-commerce. Funny, I've worked on a market platform, and I've found that to be a ludicrous statement. I'd have benefited from a loser data storage several times, joins be damned.

It's not always viable, sure, and it'd be foolish to stick with a document database if that's an important criterion for your particular use case, then no dice. It's relational data storage or bust.

Of course there's secret option number three, which is to just ignore the problem until it's a problem, just by going with a document database and see how you go, but obviously that doesn't come without risks. It's worth noticing though that Riak supports links between documents, and even fetching linked documents together with the parent in one request. In CouchDB on the other hand, you can emit linked documents in views. You can't be fully selective about the document data you're interested in, but if all you want is fetch linked documents, there is one or two ways to do that. Also, graph databases have made it their main focus to make traversal of associated documents an incredibly cheap operation. Something your relational database is pretty bad at.

Documents killed my Model

There's this myth that you just stop thinking about how to model your data with document databases or key-value storage. That myth is downright wrong. Just because you're using schemaless storage doesn't mean you stop thinking about your data, quite the opposite, you think even more about it, and in different ways, because you simply have more options to model and store it. Embedding documents is a nice luxury to have, but isn't always the right way to go, just like normalizing the crap out of a schema isn't always the way to go.

It's a matter of discipline, but so is relational modelling. You can make a mess of a document database just like you can make a mess of a relational database. When you migrate data on the fly in a document database, there's more responsibility in your hands, and it requires good care with regards to testing. The same is true for keeping track of data consistency. It's been moved from the database into your application's code. Is that a bad thing? No, it's a sign of the times. You're in charge of your data, it's not your database's task anymore to ensure it's correct and valid, it's yours. With great power comes great responsibility, but I sure like that fact about document databases. It's something I've been missing a lot when working with relational databases: The freedom to do whatever the heck I want with my data.

Read vs. Write Patterns

I just like including this simply because it always holds true, no matter what kind of database you're using. If you're not thinking about how you're going to access your data with both reads and writes, you should do something about that. In the end, your schema should reflect your business use case, but what good is that when it's awkward to access the data, when it takes joins across several tables to fetch the data you're interested in?

If you need to denormalize to improve read access, go for it, but be aware of the consequences. A schema is easy to build up, migrating on the go, but if document databases force you to do one thing, and one thing only, it's to think about how you're reading and writing your data. It's safe to say that you're not going to figure it all out upfront, but you're encouraged to put as much effort into it as you can. When you find out you're wrong down the line, you might be surprised to find that they make it even easier to change paths.

Do your Homework

Someone recently wrote a blog post on why he went back to MySQL from MongoDB, and one of his reasons was that it doesn't support transactions. While this is a stupid argument to bring up in hindsight, it makes one thing clear: You need to do research yourself, noone's going to do it for you. If you don't want to live up to that, use the tools you're familiar with, no harm done.

It should be pretty clear up front what your business use case requires, and what tools may or may not support you in fulfilling these requirements. Not all tool providers are upfront about all the downsides, but hey, neither was MySQL. Read up, try and learn. That's the only thing you can do, and noone will do it for you. Nothing has changed here, it's simply becoming more obvious, because you suddenly have a lot more options to work with.

Polyglot Data Storage

Which brings me to the most important part of them all: Document databases (and alternative, non-relational data stores in general) are not here to replace relational databases. They're living alongside of them, with both sides hopefully somewhat learning from each other. Your projects won't be about just one database any more, it's not unlikely you're going to end up using two or more, for different use cases.

Polyglot persistence is the future. If there's one thing I'm certain of, this is it. Don't let anyone fool you into thinking that their database is the only one you'll need, they all have their place. The hard part is to figure out what place that is. Again, that's up to you to find out. People ask me for particular use cases for non-relational databases, but honestly, there is no real distinction. Without knowing the tools, you'll never find out what the use cases are. Other people can just give you ideas, or talk about how they're using the tools, they can't draw the line for you.

Back to the Future

You shouldn't think of it as something totally new, document databases just don't hide these things from you. Lots of the things I mentioned here are things you should be doing anyway, no matter if you're using a relational or a non-relational data store. They should be common sense really. We're not trying to repeat what went wrong in history, we're learning from it.

If there's one thing you should do, it's to start playing with one of the new tools immediately. I shouldn't even be telling you this, since you should hone your craft all the time, and that includes playing the field and broadening your personal and professional horizon. Only then will you be able to judge what use case is a good fit for e.g. a document database. I'd highly suggest starting to play with e.g. CouchDB, MongoDB, Riak or Redis.

About eighteen months ago I wrote about going back to Vim as my daily text editor. It was a bust, and I went back to TextMate after about a week.

Suddenly it's the year 2010, and I'm typing this in Vim. What happened? My itch was re-scratched if you will. I was wary of some of TextMate's perceived shortcomings, and honestly I missed having a command and insert mode. It may sound stupid, but I really prefer that way of working with text and code. TextMate is still a nice editor, but seeing its development coming to a perceived halt made me realize that Vim is simply forever, not being developed by just one guy, but a community.

It's also worth mentioning that I simply started from scratch. Last time I built upon a configuration that grew over the years, and that included things about whose purpose I just had no idea. I watched the Smash Into Vim PeepCode too, and started with the clean slate configuration set that comes with it. If you're thinking of getting (back) into Vim, it's highly recommended, it's sure to wet your appetite. There's also a collection of screencasts and a free book on Vim 7 available on the interwebs. I have some useful links in my bookmark collection too.

There's been a lot of developments around scripts for Vim that bring TextMate-like functionality, or that support things like Cucumber, smart quotes and auto-closing braces, or even the most awesome Git integration you'll find. But the nicest of them all is Pathogen, a script that allows you to keep all your other scripts in separate places, not losing overview of what's installed where, and in which version.

Coming from TextMate, you're gonna miss the "Go To File" dialog, I'm sure. Check out Command-T, which does exactly that, only with path-matching sprinkled on top. It's not as fast unfortunately, but a lot faster to use than the annoying fuzzy thing I used the last time I tried to live on Vim. There's also PeepOpen, but it always opens files in new tabs, and that can get quite annoying, as new Vim tabs are quite different from Vim buffers. For project views I use NERDtree, though LustyExplorer also seems acceptable.

As I said, I started from scratch, with a clean slate. So the decent thing to do was to put all my Vim configuration files on GitHub. They include all the scripts I'm using, and my configuration, all neatly separated into different bundles thanks to Pathogen. There's a couple of things that are still a bit wonky. Lusty Juggler doesn't work as advertised all the time, though it's a neat tool, allowing you to quickly select one of a list of the latest open buffers. RubyTest is quite weird, and I'm thinking of dumping it completely, and simply rolling my own commands to run tests based on it. The rails.vim script package does include some support to run tests too, but not to execute a single test case.

In general, I haven't found anything that works in TextMate that you can't somehow get to work in Vim. Yes, I've used the word somehow. It's not easy as pie all of the time, and it can be different, heck it's a different editor. But I willingly accept that, because as a text editor, I find Vim to be a lot better than TextMate.

I've been back on Vim for a month now, and I'm not looking back at all. It's like coming back to an old friend and learning what awesome things he's been up to. It's pretty much as exciting as playing with new technologies at the moment. Learning new things can be pretty exciting, even if it's just another text editor. But it's not all fun and giggles. I have some annoyances still, but no editor is perfect. I'm more willing to accept Vim's for the increased text surgeon skills than TextMate's, to be frank. TextMate is still a nice editor, don't get me wrong, my heart just always belonged to Vim.

Honestly, I'm more willing to invest my learning time in an editor that I know I can use everywhere than one I can only use on the Mac with a running user interface. I'm using Vim on every server I'm managing, so why not on my local machine? Vim makes me think about how I can edit text in the most efficient way possible, and I like that very much. It even made me map my caps-lock key to control, finally!

Update: Was just tipped off that PeepOpen can be made to behave properly and open files in the current MacVim tab. When you set your MacVim options like in the picture below (notice the part "Open files from applications"), it works a treat. Thanks, Mutwin!

MacVim Options

Comments: (view/add your own) Tags: vim

I've attended my fair share of conferences this month alone, plus a Seedcamp, and I can safely say that in any way, I learned a lot about how to build slides, how to keep the audience engaged and things one just shouldn't do in a talk or in slides. While I certainly don't claim to be an expert on the topic now, I just wanted to put all of my impressions and lessons learned into a post.

I'm definitely not the first person to write about this kind of stuff, a year ago Geoffrey Grosenbach wrote on presenting, and just recently John Nunemaker wrote a post on improving your presentations for less then $50. Both are well worth reading, but they don't cover everything I find annoying in presentations, so there you go.

Slides

Keep them small

Seven bullet points per slide is bullshit, that's way too much. One phrase per slide is a decent rule, though I'm not dogmatic about it. One phrase and a couple of short bullet points (not more than four) work from time to time, but not all the time. I usually go for a bigger slide set these days, with less content on each slide.

I can run through 80 slides in 45 minutes. I know that sounds like a lot, and I certainly go through them fast, but I'd rather give people something to think about than bore them to death. Slides with too much text on it also have the negative effect of distracting the audience. They shouldn't read the slide text, they should be listening to what you have to say. Even if you do talk slow, less text on slides is always a good idea. The people should listen to you, not try to understand what your slides are saying.

What I usually do is just crank out slides with any text that I'd like to say, and then I go through them one or two times to refine and shorten the prases I used to be no more than four or five words for the most part. I also throw out slides when I realize they're disrupting the flow or contain things I'm likely to talk about when I'm on a different slide.

Use a large font

Just do it. Not only does it make your slides more readable for everyone in the audience, it forces you to keep the information on a single slide short. My headlines are usually 60pt, my subheadings and bullet points around 45pt. The bigger the better.

While we're talking about fonts, avoid italic. It's a lot harder to read, especially when you mix it with a regular font. If you need to emphasize something, just make it bold. Italic fonts disrupt your slides' flow.

Avoid full sentences

Except when you're quoting someone. Short phrases or even just a single word are much easier to grasp for the audience, and they give you a better sense of flow.

Dark text on a bright background

A dark background only works for Steve Jobs, because his team does everything they can to adjust the lighting on location for his talk. You on the other end, have to assume the worst. If there's just a little too much light coming into the room, your slides will be unreadable, when you use a dark background. I've even seen slides where people chose a dark background and just a slightly dark font.

You have no influence on the lighting in the room, and you'll pretty much just embarrass yourself when your slides are unreadable. There's just no excuse why you shouldn't just use a light background and a dark font.

Avoid dark photos

Photos are at a similar risk. The more contrast you have in photos you're using in your preso, the less likely people will be able to see them. I tend to not use a lot of photos in my slides anyway, but I just hate having to say: "Geee, that's a bit hard to see, isn't it?"

Slides are for the people attending the talk

Your slide set should not be focussed on being fully understandable by people who have not attended your talk. You end up with so called slideuments, presentations that read like a document. You're talking for the people attending your talk, they probably paid to hear you speak, so focus your energy on giving them a good talk. If you want the rest of the world to know about details of your preso, write a blog post or put it into the presenter notes.

Video killed the conference star

I've seen video in presentations quite a few times, and honestly, it bores me to death, especially when there's a voiceover on the video. If you must include video, at least talk yourself, taking the audience through whatever happens on the screen, especially because you don't know how the audio is going to be at the venue. I'm well aware that live demos are a finnicky thing, but so is video. Not always do you have the luxury of using your own computer to do the presentation.

Avoid long code snippets

Code is simply hard to grasp within just a couple of seconds, and it's awkward trying to explain larger chunks of it. Use short snippets instead. If you must include some longer examples, split it up in smaller bits, explaining them one by one. I tend to avoid overly complex code snippets. Trying to explain them properly just takes too much time.

Avoid flashy animations

They simply take up valueable time and distract the audience. Even though they're nice to look at in theory, in practice they're the bane of a well-built presentation. This is true for both transitions between slides and elements of a single slide appearing later. Just make them appear, not sparkle or fade in.

The Talk

Practice, practice, practice

I find practicing a talk by speaking to myself awkward, not because it's embarrassing, but simply because of the butterflies in my stomach I always end up saying different things in the actual talk. Now, that's not to say you shouldn't think about what you want to say. I tend to go through my slides several times, going through the things I associate with every single one of them, giving me a rough idea and a line of thought on what I want to say. This definitely is a lot easier to do when it's a topic you've talked about before, but in general the above has worked much better for me.

Drink, drink, drink

It's a simple fact that talking a lot lets your mouth run dry. I need about half a liter of water to get through a talk. Or at least I make sure I have that amount ready. Before you run dry and faint in the midst of your talk, drink, it's not a shameful thing to do, it simply keeps you going. Shame on conference organizers not thinking about having drinks ready for their speakers. When in doubt, scout the talks before you and make sure you have a bottle ready should it not being taken care of.

Look at the audience, not the big screen

It should be so obvious, yet I've just seen people do it again at Cloud Expo. One of the guy's slides had 14 bullet points on it, and the font probably was too small for him to be able to read it from the laptop screen. Another reason why I keep my slides short, they're purpose is to keep me in a flow, to give me short reminders of what I want to talk about.

Don't read your presenter notes

If you need presenter notes to run your talk, you need to practice more. They're surely useful for people just looking at your slides, but if it takes full sentences to keep your talk running, you'll end up wasting a lot of time trying to read what your notes say. Talking freely is a challenge, but the earlier you take it on, the faster you'll get used to it. I've seen people use index cards with their presenter notes on them, handwritten, trying to decipher what they've written on them.

If you know what you're talking about (at least the slightest bit), you'll be fine without them, trust me.

Two's not a company

Having more than one speaker is awkward, especially when one of them is just standing there for most of the time, waiting for his turn. Have one up in front at any one time, bring in the next person when it's his turn. Simple like that.

Don't ask questions

The audience simply won't answer. If you ask anything, make the audience raise their hands on a topic, but don't expect anyone to answer a specific question. That's your task. Involving the audience sounds like a good idea, but they're lazy, they want to learn something.

Jokes, tiny bits and stories

Stories and jokes can really lighten up a presentation. Sure, you shouldn't tell jokes all the time, but something sarcastic thrown in from time to time sure can help to wake up the audience. Stories are even better, people love benefitting from real life experiences in any way. If it has a happy ending, even better.

Talking slowly is for wimps

The rule of spending two minutes on a slide is bullshit. It would only mean you'd have seven bullet points on a particular slide. You shouldn't rush through anything, and I certainly try to avoid doing that, and it definitely depends on the topic you're talking about, but when I talk about technical things I expect the audience to be curious about it and try to keep up. If they can't, they can always come back to my slides or ask questions. But as always, it depends.

Talking fast is for the impatient

If it's on more generic things that involve higher level topics, or some sort of longer-running workshop, it's only appropriate to walk the people through it and take your time doing so. Usually in these situations it's a lot easier to focus on a single topic. It just depends on how broad your talks topic is.

Take tiny breaks

Should you realize you're sort of losing track, simply bring yourself back on the rails. Take a tiny break or just stop talking. You don't need to apologize for that. It's easy to start blabbering on about a certain topic which you didn't even intend to cover in your talk. On the other hand, that's what makes every talk unique, and is exactly why shorter phrases on slides are so much better. They keep your brain engaged, making up associations with certain things as you go, and they help keeping a talk interesting.

Avoid longer breaks though as people end up being bored, and you're losing precious time. Longer breaks are usually a sign that you're not as prepared as you should be. If you need to switch in between e.g. slides and a live demo, make sure that everything is prepared before the talk.

Talking in front of others is a challenge, no doubt about it, but there's really no point trying to avoid it, because the only way to improve your skills is to simply talk in front of people. This is my view of the talking world. I constantly try to improve on my slides and think about what I'm doing wrong during talks to improve on that. I'll never loose the excitement right before a talk, and that's a good thing. When it becomes routine, you tend to bore people instead of engaging them. It's about constantly improving yourself to simply become better at talking in front of others.

This is my view of giving presentations. Feel free to throw in your ideas, or even to disagree. These guidelines probably aren't for everyone, and they might even change for me within just a couple of months, but most of them simply make sense to me. I do need to get me a good remote though, since with my larger slide sets, I find myself hitting the space bar a lot.

<< Archives | Search | RSS Feed