My memorable fail

July 5, 2024

It was past 6pm on a Friday of 2018. I hit enter, and watch the script run the tests for database migration. My team-mates were busy enjoying the Friday beers while leadership delivered the week's highlights.

The terminal output showed all tests passed. I smirked before glancing at the production database. Usernames like Erlich Bachman, and Richard Hendricks stood out on my screen.

These fictional characters from Silicon Valley... how did they end up being our customers?

Panic.

DROP TABLE users

In the backdrop of tipsy engineers, I sobered to my realisation that my tests earlier ran against production.

All user records dropped in place of dummy ones, just like that.

“Hey guys, I think I might have just dropped the user database on production...”

Context on my mistake

What simply happened there, was I have used the wrong environment file, and ended up pointing to production instead of our development environment.

My main project then was to migrate the Firebase database to Postgres (GCP CloudSQL) in phases. I had been context-switching between development, staging and production environments prior to the mistake.

Murphy's Law strikes

“Thank goodness we have scheduled backups on the Firebase database. We can restore from there!” my colleague rushed to my rescue.

We restored the latest backup, and watched in horror as random usernames showed up again.

It turned out that we did have regular backups. However, these backups were mistakenly made against staging, not production.

Luck

Luck came in the form of a local backup of the production database I had on my machine. This was 1. not compliant; I should not have a local copy 2. a few weeks old so there are missing users still.

It also happened to be a 3-day weekend.
This allowed us more time to try to recover the data over the weekend.

We were able to recover about 80% of the users, and recreate user records for the “missing” customers (our account managers helped a lot).

Mistakes and corrections thereafter

All engineers then had “god mode” across all environments then. We tightened our IAM permissions, and also leveraged role-based accounts.
Our scheduled backups were never verified until then. We performed regular checks on our database restore process.

Aftermath

Today, many of us have left that startup but still catch up from time to time.
We often reminisce about this incident, in good spirits.

For me, this remains my go-to “Tell me about your toughest time” story.

#failure #production #horror

buy Kelvin a cup of coffee ☕