Post-mortem: Blob Storage Outage
Val Town provides blob storage for use within Vals. Blob storage is used by many Vals as a quick key value store or as a place to store larger files like images and logs. Here’s a basic example:
Under the hood, blob storage is backed by Cloudflare’s S3-compatible object storage offering: R2.
On April 30th at 5:05pm UTC while offboarding a Val Town employee, we removed their user account from our Cloudflare organization. The API credentials that we had been using to communicate with R2 were tied to the user account, and stopped working once the account was removed. At 8:00pm UTC we had received a handful of user-reports about unexpected blob storage errors. Over the next hour, we raised an alert internally, identified the issue and replaced the credentials.
Impact
Blob storage was unavailable from 5:05pm UTC until 9:05pm UTC. During that time 3,492 read requests and and 135 write requests failed to communicate with blob storage.
Cause
Cloudflare’s API credentials are tied to user accounts. When clicking through Cloudflare’s R2 dashboard there is an API Tokens section where you can issue S3-compatible API credentials. It is not entirely clear that this page is scoped to an individual user and not the Val Town account itself. We were unaware of the scope of the credentials and did not appropriately ensure that the API credentials were tied to a long-lasting account.
Cloudflare recommends that organizations create a system account for credential creation.
Next Steps
Val Town is taking the following steps to ensure issues like this do not happen in the future:
- As Cloudflare recommends, create a system account and associate our Cloudflare API credentials with that account.
- Audit our API credentials used with other services and ensure they don’t face similar issues.
- Set up continually-running integration tests for core Val Town features to make sure we’re alerted about downtime like this promptly and directly.
- Email all affected users to work with them to recover any lost data, or draw their attention to any error logs that might have resulted from the outage.