This paper showed up in my inbox from the Morning Paper this morning1, and I had two initial thoughts:

  1. Wow, this is cool as hell
  2. Oh no, somebody is going to try to implement this before they really need it.

We’ve seen a common pattern develop with infrastructural testing tools, with chaosmonkey being the one with the most mind-share. Tools like this are totally necessary at Facebook or Netflix scale, where outages can be measured in millions of dollars, but they add complexity to your infrastructure. This complexity, however, does not scale linearly with application complexity or traffic. For simple or low-traffic applications, tooling like this will be a high percentage of your overall system complexity, and complexity comes with costs at multiple stages - opportunity cost during implementation and velocity cost over time. Before adding complexity on this scale it’s important to at least roughly calculate the return on investment for your added complexity - you might find out that 30-60 minutes of downtime actually costs you far less at your product’s stage than complexity would.

  1. You subscribe to the morning paper, right? If not, you should!