2020-10-24
For roughly 4 hours between 9:00 AM and 1:00 PM Eastern Daylight Time, we experienced higher than usual API response times and intermittent unavailability.
This affected parts of our API and application, making it slow or unavailable at times.
In the very early morning hours of Saturday, October 24th, when usage of our APIs was at its lowest point, we performed routine maintenance to increase the amount of storage available to our primary database.
Our database relies on a block level caching layer in front of this storage layer so that database reads are very fast. As a side effect of increasing the available storage of the database, this caching layer gets cleared. Expecting this, our routine maintenance included a process to repopulate that caching layer with our most frequently accessed data.
As API usage rose throughout Saturday morning, we started to see higher than expected latency and error rates from our APIs. Through investigation of the issue it became apparent that our data access patterns had changed enough since our last storage upgrade that our cache repopulation process had not sufficiently warmed the cache. This meant that our data access patterns were not being well supported by the cache so our database read queries were slow.
We began populating the cache with the data that matched our new data access patterns, and as the cache populated our response times and error rates started to decrease.
In the future we will do a better job of finding out which data needs to be pre-warmed into the cache based on current data access patterns, instead of relying on previous behavior.