Outage July 26, 2016 Post-Mortem

Today at around 1:50pm Pacific time we had a brief outage where the site would display as a white screen and the API was unreachable. The cause of this was some new code added to the site that worked fine in development, but caused errors in production. Thankfully, our server monitoring caught the issue immediately and we were able to rollback to the previous version of the site within minutes.

Some Background

As part of the work we did to write the 3rd version of the API, we adopted Babel as a build tool, which allows us to write using features that have been confirmed for use in the Javascript language, but are not yet present in it.

In development mode, we transpile the code on the fly. This is great for working locally, because we don't have to wait for a transpilation step to occur, but it's not great for running a production application, so we transpile all the code at once before the application starts.

However, there is one particularly tricky part of our code base, the files that are shared between the server and the client. In order for this dual set up to work, the server must require these common files from a single entry point so it's referencing the right versions of the files.

So What Happened?

We put out some code in the server that referenced a particular file in the common/script directory. It worked fine in development, because we were transpiling all those files on the fly, but in production, the Node server choked on some syntax it didn't recognize. The fix was as easy as changing the path.

Only a few of us were aware of this quirk of our setup, and those of us who knew about it did not catch the mistake until it was too late.

Will this Happen Again?

I've added a sanity check that no code in the server will reference any of the files in the common/script directory directly. So if the mistake does happen again, our automated tests will catch it before it is deployed.

Thanks for reading and being awesome Habiticans!