Incidents and Postmortems

Recently, I was consulting at a place which used CircleCI as part of its CI/CD setup and as part of working there I signed myself up to receive email from them. I’ve known about and used CircleCI in the past and knew it as a place which used clojure to build their product which I thought was quite interesting.

These days, you don’t hear much about clojure. Rich Hickey, creator of clojure, who used to give a lot of insightful and provoking talks, has largely been quiet the last few years. I tried using it myself here and there, but found working on top of Java and the JVM too frustrating with enormous stack traces and all the quirks of the Java ecosystem peeking through the layers of the clojure VM. I still think clojure is the most ergonomic and “modern” of the lisp variants I’ve tried, and it would be nice if someone built something like it on LLVM for example. But I digress…

CircleCI recently had a really serious security incident and as part of that published a post-mortem report which I found to be really fascinating and well-written, so I thought I should recommend it here as an example of how I think post-mortems should be written.

CircleCI incident report, Jan 4, 2023

Some of the highlights include great security recommendations for working with access tokens (make sure they expire), 2FA (didn’t help in this case) and IP range limits, among other things. The level of sophistication in the attack is also striking. Not only was an individual employee targeted as the attack vector, but in order to actually get access to production systems the attackers had to generate new production tokens and exploit running services for decryption keys. Nightmare fuel for anyone who has deployed services on the internet, ever.

I’ve thought a lot about documentation and what kinds of documentation I find effective and worthwhile. Making a push towards better documentation can easily lead to large amounts of writing that either goes unread or quickly becomes stale. The question is how we can concentrate our documentation efforts in areas that are effective and in ways that age well.

Using RFC document processes for new feature development is one area where I think writing good documentation is essential, and postmortems is another. Both postmortem in the sense of writing down a summary of the process as part of a release, not only to publish externally but also for internal use. Reflecting back on what went well, what was accomplished and lessons learned is a great practice and remains valuable even if the project evolves. The same goes for incident reports. Even if no one reads them later, the process of writing an incident report and doing proper root cause analysis is incredibly valuable in itself.

Reference manuals are useful only if they are generated from the code, and preferably shouldn’t require manual markup in the code to be useful.

Guides and tutorials don’t age well, that’s true. But they are also very useful, especially for public/free software projects. The great thing about an installation/setup/usage guide is that it makes the intent of the software explicit. A guide to using something tells you a lot about how the thing is meant to be used and what the scenarios that the authors imagined are when writing it. If you are deploying a project in an environment or use case that isn’t covered by a guide or in ways that contradict the provided usage guides, then you can expect to hit parts of the software that either don’t work or don’t work as intended.

My feeling is that a lot of developers either don’t spend enough time writing documentation, or when they do, they write documentation that doesn’t age well or is written in a way that ends up being less useful than it could be.