At Jana, we’ve started to outgrow our infrastructure. For all the right reasons, it’s time to revisit how the computers that run the company are set up so we can continue to operate during the next big jumps in growth. We’re handling the high end of what we’d expected when we first set out.
There are a few major guidelines we’ve followed when designing the next stage. When making decisions, adhering to these or at least acknowledging why it’s necessary to deviate helps to keep us honest:
Respect the One True Place for configuration
All configuration doesn’t need to be stored in the same place, but each piece of it should have a canonical source. Any time there are multiple sources for a piece of information, you can be sure that they’ll eventually differ. We have a few places we keep configuration data, depending on the need– s3, aws metadata, dns, project configuration files, and salt states are the primary ones. But for each parameter, there should be exactly one place to go to find its value.
Unless there’s an overwhelming advantage, the system with fewest pieces is best
There are a lot of fantastic operations-as-a-services out there that will do every piece of your infrastructure amazingly well. And every one has a nonzero failure rate. Worse, each one you add must play well with all the others both now and when they push out the next seemingly benign change that probably won’t affect anything else. For us, AWS does a lot of things quite adequately that there are done better by several well-known 3rd parties. We lose some bells an whistles, but gain uptime in exchange.
Humans screw up
Human intervention should be a last resort. Any time a single human being is directly affecting the state of our servers, there’s an opportunity for an error. Not to say code can’t seriously screw things up, but code can be reviewed and audited. The most concrete cases of this are ssh’ing into a production machine or making changes in the aws console. Both have no review, and if things go awry going back through the steps is difficult to impossible. I have a half-joking plan to record a server error in our metrics every time someone ssh’s in to a server.
Everything belongs in source control
This is closely related to the last one. The entire configuration of your systems should live somewhere where an engineer can see the entire setup– nothing in someone’s head or on their local machine. Manual edits can easily be lost or improperly recorded after the fact. Driving configuration from code forces everyone to commit their process before it’s applied. It also guarantees that you can recover to your current state in the face of any failure.
Plan two steps ahead
There’s lots of flexibility in this one, as you can define what your own steps are. The main point is that you’re not planning for the short term, where you’ll end up needing to redesign everything again once you’re done. But planning for the far off future can be way too large a task, and typically involves assumptions that may not always be true. Two steps for us means 50-100x our current load. As we approach those numbers, I expect small adjustments to our infrastructure will no longer keep us going. I also know that I can’t predict what our needs will be at that point, so we can just plan to get there as smoothly as possible.
Want to learn more? We’re doing a free talk on these topics Tuesday December 16th, 2014: sign up.