I thought that I'd post a pointer to this analysis of the foursquare/mongodb outage (for completeness, here is foursquare's description of their outage). I'm not a foursquare user, so I didn't know anything about this outage until I was catching up on my RSS reading last week, but I think the lessons are broadly applicable.
The key point being that even on a modern/cutting-edge platform, low level details of how your data (and processing) maps onto the underlying hardware can be the source of major headaches.
In general, the number of possible details that could bring a system to its knees is too large to keep in one's head at all times. In practice, what is required is the ability to track down why the system isn't responding to your changes as expected. In the FourSquare case the question became: "why wasn't the RAM usage decreased when 5% of the data was moved to another shard?"
The solution(s) are always obvious in retrospect, and the FourSquare mongodb team appear to have determined the root cause with admirable speed. In my mind the key to this speed was their ability to quickly get to the question "why didn't my RAM usage decrease" from the question "why didn't my system speed up when I added another shard."
At some level this goes back to my post last year on the Instruments performance analysis tool. A good substrate of these tools is critical: if you don't have a performance meter that's pegged, in this case the swapping/paging meter, getting a handle on what's going on is difficult, to say the least.
I'm glad I didn't personally experience that particular debugging session.