Delays in write propagation

Incident Report for Oso

Postmortem

On September 28, 2023, we experienced a degradation in service resulting in a delay in write propagation from 12:04 to 12:34 UTC. Below is a brief description of the timeline, cause, and next steps.

For background, we distribute data to our edge nodes in order to optimize latency on the read path. Part of that involves using event streaming to propagate writes.

We received an alert at 12:05 UTC about delayed writes and begun to investigate. By 12:30 UTC, the system started to recover by itself. We pushed a fix and the issue was resolved by 12:40 UTC.

There was no data loss or impact to our data storage, or any other action needed by impacted parties.

Upon completing an internal post-mortem, we concluded that:

Our write infrastructure is not sufficiently robust to handle unexpected load or unexpectedly slow processing jobs.
We did not have sufficiently good internal tooling to allow us to unblock writes earlier

To address these gaps, we have added backpressure to write operations going through our event streaming infrastructure to limit the impact of unexpectedly slow operations. This will have the effect of throttling those slow operations rather than impacting the overall architecture.

Additionally, we are investigating changes to our write infrastructure to make it more durable to this kind of issue.

Posted Oct 13, 2023 - 17:38 UTC

Resolved

On September 28, 2023, we experienced a degradation in service resulting in a delay in write propagation from 12:04 to 12:34 UTC.

During that time period, calls to Oso may have been returning stale data.

At 12:40 UTC we rolled out a fix to prevent the same event from recurring.

There was no data loss or impact to our data storage, or any other action needed.

Posted Sep 28, 2023 - 12:00 UTC