Workout Load Failure
Incident Report for iFit
Postmortem

Workout Load Failure Postmortem

Date

2019-10-17

Summary

An issue with a caching server caused some API responses to return error messages instead of cached responses.

Impact

Roughly one third of HTTP requests for certain API routes would have received an error instead of the expected response. This would have affected several distinct services and applications, mostly involving consumer facing equipment.

Root Causes

A hard disk on one of a set of servers in a caching layer in front of our API servers malfunctioned, which caused that server to respond to HTTP requests with errors instead of cached responses.

Resolution

The server with the problematic hard disk was replaced with a server with a properly functioning hard disk.

Action Items

  • Update our periodic health check of the caching layer servers to detect hard disk failures.
  • Fix some improperly configured monitoring on the number of errors being returned from the caching layer, so that we receive an earlier notification if this type of issue happens again.
Posted Oct 17, 2019 - 11:36 MDT

Resolved
This incident has been resolved.
Posted Oct 17, 2019 - 10:49 MDT
Monitoring
A fix has been implemented and we are currently monitoring the results.
Posted Oct 17, 2019 - 09:39 MDT
Identified
We have identified the issue and are working on implementing a solution.
Posted Oct 17, 2019 - 09:36 MDT
Update
We are continuing to investigate elevated error rates on workout-related API routes.
Posted Oct 17, 2019 - 08:51 MDT
Investigating
We are currently investigating an issue that is preventing workouts from loading properly. This is affecting workout-related pages, including the dashboard. We will provide updates as soon as possible.
Posted Oct 17, 2019 - 08:10 MDT