Saw a ton of traffic on Write.as this morning, around the same time we do most days, but a magnitude more than usual. It was concentrated on blogs hosted under the write.as
domain, and it eventually affected the web application.
I realized that the bottleneck was most likely in the database — too many connections were being made, slowing down queries, causing the application servers to wait too long, causing the backlog of connections into the site to pile up.
To solve this, I added another database replica, and then dug into the application code. While requests were dragging, they were still being fulfilled by the server and database, even long after a visitor might’ve reasonably expected a response, or completely left the page. So I took advantage of Go’s context
package to put some limitations on these database queries, particularly the ones on read-heavy pages, like blogs and posts. I tracked commits on T882. The traffic subsided by around noon, and I deployed these changes a bit after.
The changes won’t totally prevent this from happening again in the future, but they should reduce the likelihood (since database connections won’t pile up so much), and should give people a better experience — now they’ll see a clear message saying that we’re under heavy load, instead of a blank browser error.
Thoughts? Discuss...