Resolved -
We're marking this issue as resolved.
We identified the root cause as a breaking change released last night in a core python library setuptools that caused an issue with one of our downstream packages. This error occurred when new resources were spun up, but didn't impact old ones. As a result, our service remained up, but latency increased significantly across a few key endpoints until it was caught by our monitors this morning.
We've made a few key operational changes as a result of this incident. In this case, our monitors didn't flag the issue as an outage, which reduced response times. We've reduced the threshold for our monitors to alert the team on smaller spikes in latency, and backtested to make sure that they would have triggered earlier for an issue like this that affects new resources only.
Feb 9, 08:25 PST
Monitoring -
The incident has been resolved and an update rolled out. We're still monitoring the resources before marking this as resolved and posting an RCA.
Feb 9, 08:06 PST
Update -
We are continuing to work on a fix for this issue.
Feb 9, 07:51 PST
Identified -
We've identified that this is a partial outage affecting auto-scaling resources that have recently been created. Some requests are being served properly.
We're currently rolling out a fix to our resources that we expect will fix the root issue.
Feb 9, 07:51 PST
Investigating -
We're currently investigating an API outage. We've isolated the issue to a downstream package upgrade in the python ecosystem that happened overnight and we're currently investigating a hotfix.
Feb 9, 07:30 PST