Post-Mortem: Android Outage

tl;dnr

A change to the API related to how collection quests are handled caused the Habitica Android App to cease functioning. The issue is now resolved.

Stats

  • Outage Time: June 6th, 2016 @ 12:56pm UTC - June 7th, 2016 @ 2:05am UTC
  • Percentage of Active Android Users Affected: 57%

Why did this happen?

In an effort to fix a pernicious quest collection bug, we changed the API to determine what items were found in the collection quest at cron instead of when scoring a task. This allowed us to use the party's data to determine what quest the user was on, rather than the user's, which could get out of sync.

Part of this change involved changing the user object's user.party.quest.progress.collect property from an Object to a Number. Since these were properties that were only used by the server at cron, we didn't think much about changing them.

Unfortunately, even though the Android app was not using that property, it was still mapping that attribute and specifying that it should be an Object, not a Number. When it got data that was not how it expected, it showed a blank profile, which made the app unusable.

We have Fabric installed on the Android app, and we thought it would have reported to us when this many failures were happening, but for some reason it did not. We are still investigating why that happened. As a result, we did not know about the issue until around 9:00pm UTC on Sunday.

How we fixed it

Alys was the first one to notice, and she immediately began working on a fix. In short, we set user.party.quest.progress.collect back to an Object and created a new property to track collection items, user.party.quest.progress.collectedItems, which is a Number.

This lets us deprecate user.party.quest.progress.collect without breaking backwards compatibility. The progress.collect property will be fully removed in the next major version of the API.

After extensive testing, we deployed the change and ran a migration file to copy over everyone's collection progress to the new property and set the old property to an empty object.

What we are doing to prevent this in future

  • We're investigating why we were not notified via email by Fabric
  • We've set up better alerting for errors like these, so the whole team can see when there are problems and help troubleshoot together
  • We're going to work on setting up automated integration tests that run mobile apps against the latest version of the server code to catch issues like this before they happen

An Apology

If you were affected by this outage, we're very sorry. We know that having Habitica unavailable is frustrating and disruptive. We want to thank you for your continued support. We're working hard to prevent this from happening again.