Slow API Response Times

Incident Report for iFit

Postmortem

Elevated API Response Times and Error Rates

Incident Date

2020-10-24

Summary

For roughly 4 hours between 9:00 AM and 1:00 PM Eastern Daylight Time, we experienced higher than usual API response times and intermittent unavailability.

Impact

This affected parts of our API and application, making it slow or unavailable at times.

Root Cause

In the very early morning hours of Saturday, October 24th, when usage of our APIs was at its lowest point, we performed routine maintenance to increase the amount of storage available to our primary database.

Our database relies on a block level caching layer in front of this storage layer so that database reads are very fast. As a side effect of increasing the available storage of the database, this caching layer gets cleared. Expecting this, our routine maintenance included a process to repopulate that caching layer with our most frequently accessed data.

As API usage rose throughout Saturday morning, we started to see higher than expected latency and error rates from our APIs. Through investigation of the issue it became apparent that our data access patterns had changed enough since our last storage upgrade that our cache repopulation process had not sufficiently warmed the cache. This meant that our data access patterns were not being well supported by the cache so our database read queries were slow.

Resolution

We began populating the cache with the data that matched our new data access patterns, and as the cache populated our response times and error rates started to decrease.

Action Items

In the future we will do a better job of finding out which data needs to be pre-warmed into the cache based on current data access patterns, instead of relying on previous behavior.

Posted Oct 26, 2020 - 12:53 MDT

Resolved

This incident has been resolved.

Posted Oct 24, 2020 - 12:44 MDT

Update

Our cache seems to be holding up well, and our error rates and latencies remain healthy.

Posted Oct 24, 2020 - 12:44 MDT

Update

We are continuing to monitor for any further issues.

Posted Oct 24, 2020 - 12:42 MDT

Update

We are seeing large improvements in our cache hit ratio. Error rates and and latency continue to look healthy as well. We will continue our cache-warming efforts and will closely monitor affected resources during this time.

Posted Oct 24, 2020 - 12:07 MDT

Monitoring

An immediate fix has been implemented and we are seeing a reduction in latency along with a drop in error rates. We are continuing cache warming efforts at this time and will continue to closely monitor affected resources.

Posted Oct 24, 2020 - 11:34 MDT

Identified

We have identified the cause of the latency issues that users have been experiencing since early this morning. This issue was caused by some unexpectedly long cache re-population time after some routine maintenance on our databases last night. We are looking for ways to mitigate the issue and improve database response time.

Posted Oct 24, 2020 - 10:50 MDT

Investigating

We are currently investigating this issue.

Posted Oct 24, 2020 - 08:11 MDT