Writing about open & equitable product development

Big AI Commons

I never seize to be amazed by how accepting we are of the exact same multinational corporations who under no uncertain terms spent the last few decades diminishing our personal agency, unraveling our communities and strangling our nascent democracies in the cradle.

The last trick the software oligarchs pulled on us was the idea of Big Data as something that magically appeared behind the fortified walls of their data centers, as if organically home-grown and lovingly tended to. And only they, with their unparalleled wits and computing power, were fit to manage all this data at scale.

Except the only thing that was special about these data troves was how much of them they’d been able to collect and trade amongst themselves without our explicit consent. That was the era of surveillance capitalism. With the emergence of so-called artificial intelligence, powered by non-consensual data mining, the corporations move on from the surveillance trade to straight up spycraft; the society-controller of choice for authoritarian regimes.

So up next is control capitalism, which is just fascism with the toothbrush mustache grown out for a more fun, twirly aesthetic.

We are regressing back to the ugliest kind of class divide, wherein the owner class commands your will not merely because they own things you do not, but because they own you. They’ve already laid claim to our collective land, labor and attention. With AI, they want to own our thoughts and the last shred of agency that comes with them. If we fail to defend our personal sovereignty at this juncture, a dark age of the corporate singularity awaits us.

This article, which turned out way different than I expected, was first ignited by Mike Masnick's reporting of AI critics employing copyright law as their weapon of choice against extractive data hoarders. As an open source advocate I wholeheartedly share Mike’s fear of IP maximalism. The problem this legal tactic is attempting to solve however is as real as it is harmful, so to refute the tactic begs the question: What, then?

Commons Maximalism

LLMs and their ilk, or what Emily M. Bender calls Synthetic Media Machines, are premised on large libraries of data. Without big data, they can’t function. Arguably their collection and mass-synthesis of this data is fair use, and I won’t dispute that.

The weird thing about these contraptions is that they aren’t libraries you can go to and ask for specific items to be retrieved according to some query, like ‘books on insects’. An SMM will be able to give you a list of books on this subject (with varying degrees of truthfulness), on the account of the SMM having actually consumed these books for its own edification.

But what it would much rather have you do is ask it to write something more specific about insects on its own accord, made for you and you alone. Thus, making you reliant on the synthesizing automaton as your primary source of knowledge. And to be clear, the contraption in question here has no will of its own. Its incentives and motivations are purely an extension of the corporate master that controls it.

Designed for the betterment of society, an automated synthesizer would happily (there’s that anthropomorphic slip again) tell you about every single piece of information it has ingested. When outputting a synthesized information blob it may not be able to tell you the exact sources from which this output was derived (because that’s not how SMMs work), but it absolutely could do a reverse-search on its own corpus of data and tell you which articles / books / images / films are most similar to this “new” thing you now have in your possession.

If this type of backwards looking similarity-search was standard practice, you would always learn of some original, human-made media that is remarkably similar to what has been machine-generated for you as if by magic. The truth of art making is that there is no such thing as a truly original creation. Every new thing is a remix of a prior.

(Steal Like An Artist makes that case beautifully.)

The infinite riches of media that we continue to share freely on the internet aren’t put there for the purpose of capture and capitalization. We share our art so yet more art can be made from it, under a social contract of mutual reciprocity.

Big Tech doesn’t reciprocate. Our public data isn’t for them to do with as they wish, especially not when their wish is to subordinate us into a brave new world of techno-feudalism. But ownership is tricky. I can claim some ownership over this article I’ve written, but I cannot possibly lay claim to the impression it has on its various readers, nor can I claim ownership of new art that only to a vague and partial degree is derived from it.

Our public data doesn’t belong to the corporations, but it doesn’t belong to us either. Not when it has been converted from data-contents to data-impressions. At that point, your ideas ‘live rent-free’ in any willing or even unwilling recipient’s mind. Like the air we breathe and the water we drink, freely available data doesn’t belong to anyone. What belongs to no one belongs to The Commons.

Attack their bigness

From a simplistic point of view, an SMM is just another thinking agent going around consuming content and forming its own impression thereof. If we try to combat the harms of AI companies from this vantage point, we’ll only end up harming individual creators. Attacking how the machines work is an aimless swing at their most ethereal form, destined to find no target to make contact with but our own sorry faces.

To land a real blow, look for where the machines are at their most materialized. Take aim at their massive bodies of data and strike there with conviction. The Large Language/media Models rose to prominence through their unfettered bigness, and that in turn shall be their downfall.

Pacify the profit incentive

Here then is my very simple policy proposal: Big Data AI is by definition a product of our global data commons, and as such any product derived from it should only be allowed for non-commercial purposes.

Commercial applicability should shrink relative to the size of data vaults. Much like a wealth tax on data, this aligns neatly with the EFF’s recommendation of a Privacy First approach to addressing online harms.

Regulators have an innate understanding of bigness and scale. Some AI regulation in the USA already stipulates special restrictions for AI operations that exceed a certain compute threshold. Regulating by data mass is probably an even more tangible metric to enforce by.

Furthermore, the doomers who are concerned with the rampant development of AGI should be very happy with this*, because a lack of commercial incentive would undoubtedly slow the unchecked pace of AI among the most unscrupulous for-profit actors, leaving academic researchers and CERN-like international collaborations to lead the way.

(*Unless, god forbid, they weren’t actually sincere in their ethical trepidation and were actually just angling for a competitive advantage.)

Our public libraries are shining examples of our social ingenuity. “Knowledge wants to be free” we said, and collected it all in these massive repositories made by the people, for the people. For a while, we did the same thing with the internet, at global scale. The AI renaissance could still turn out to be a good thing, but only if we reject its cooption by the already most powerful few.

The art of knowledge work is inherently relational and referential. The way we make sense of new information and transmute it into lasting wisdom is by following the trail left behind us by the knowledge workers of old. If that historical chain of attribution to prior art is severed and we lose sight of where our current state of knowledge comes from, we may as well start all over again from scratch, and we just don't have that kind of time.

Done right, AI assistants of the LLM variety ought to be like a library and a librarian fused together. And doing that right means we would have actual human librarians still in the loop to mediate between mortal knowledge seekers and the god-like but far from infallible super librarian.

Such an interaction would likely feel much less like being on the receiving end of a bullshitter’s behind, and more like making, eating and digesting your very own food for thought in the company of our peers, both past and present.