On September 28, 2023, we experienced a degradation in service resulting in a delay in write propagation from 12:04 to 12:34 UTC. Below is a brief description of the timeline, cause, and next steps.
For background, we distribute data to our edge nodes in order to optimize latency on the read path. Part of that involves using event streaming to propagate writes.
We received an alert at 12:05 UTC about delayed writes and begun to investigate. By 12:30 UTC, the system started to recover by itself. We pushed a fix and the issue was resolved by 12:40 UTC.
There was no data loss or impact to our data storage, or any other action needed by impacted parties.
Upon completing an internal post-mortem, we concluded that:
To address these gaps, we have added backpressure to write operations going through our event streaming infrastructure to limit the impact of unexpectedly slow operations. This will have the effect of throttling those slow operations rather than impacting the overall architecture.
Additionally, we are investigating changes to our write infrastructure to make it more durable to this kind of issue.