machine learning, computer science, and more

Two cheers for the file system

The file system is a fundamental part of the API provided by your operating system. Yet it’s also getting long in the tooth. Last year The Verge reported that college professors are struggling to explain the concept of files and file systems to students who grew up on smartphones and clouds:

It’s the idea that a modern computer doesn’t just save a file in an infinite expanse; it saves it in the “Downloads” folder, the “Desktop” folder, or the “Documents” folder, all of which live within “This PC,” and each of which might have folders nested within them, too. It’s an idea that’s likely intuitive to any computer user who remembers the floppy disk.

More broadly, directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location. That’s a concept that’s always felt obvious to Garland but seems completely alien to her students.

The concept that data is a thing that is stored in a location is not just an artifact from the era of personal computers and slow 56k internet. It is the essence of computation as such. When Alan Turing proposed the Universal Turing Machine, he created an abstract mathematical model of computation, not a physical object. The API of a Turing Machine (its tape) is equivalent to the API of the Unix filesystem is equivalent to the API of the Python interpreter.

To be able to compute is to be able to arbitrarily read, write, and manipulate data. You cannot compute on your iPhone or on Google Drive. The college freshmen who know only the new locked-down APIs have been deprived of the expansive experience of general-purpose computing. They have known only the narrow experience of restricted consumption. To be deprived of this experience is to lack the intuitions that make it easier to learn programming, and the intuitions that make it possible to envision world-changing innovations applying the power of computation.

So why only two cheers for the file system?

First, while operating systems (thinking especially of Unix here) offer a decent API for reading and writing objects, their API for manipulating objects is painfully bad. Even though the Unix file system offers an amazing API for storing and reusing intermediate results, shell scripts are terrible. So people instead write data-processing programs in Python, but this creates a new set of problems. In data science (especially bioinformatics) storing and reusing intermediate results via the file system helps you save computing time. It also helps you inspect and debug complex pipelines, and recover (partially) when programs are killed by unhandled exceptions.

Second, I fudged about filesystems being “the essence of computation as such.” Alonzo Church invented the \(\lambda\)-calculus, which defines the same set of computable functions as that defined by Turing. Functional programming offers an alternate API for specifying programs, one which may be better suited to networks of interacting users. The Urbit operating system and ecosystem is currently being developed on this premise, and I’m extremely excited about its prospects for making the Internet personal and computable again.

Despite these drawbacks, I believe that the Turing Machine paradigm will endure. Scientific simulations, ML model training, and data analytics will continue to grow in importance. These tasks (which Turing himself helped pioneer) are well-suited to filesystem APIs. Packages like Snakemake are helping unite the benefits of Python and the filesystem. In contrast, customizable yet not Turing-complete systems are inappropriate as end-to-end solutions for non-trivial modeling scenarios. You inevitably bump into the pre-ordained limitations of GUIs, configuration files, and SQL queries, so you end up writing code. To paraphrase Greenspun’s 10th rule, any sufficiently complicated declarative language contains an ad hoc, informally-specified, bug-ridden, slow implementation of \(1/\infty\) of a Turing-complete language.

Due to the beginner-friendly benefits of imperative programming, and the importance of the filesystem for data-processing tasks, familiarity with the filesystem will remain essential.

#programming