May 9, 2026 · 14 min ESSAY

When Algorithms Carve: Classification at Machine Speed

A sequel on classification, this time at industrial scale: the labels of supervised learning, the impossibility theorems of algorithmic fairness, the latent geometries of foundation models, the contested project of mechanistic interpretability, and the looping effects of LLMs deployed at the speed of a search engine.

In 2024, Anthropic researchers looked inside Claude 3 Sonnet and found roughly 34 million internal categories the model had built without being told to. Some had labels you would expect, like “the Golden Gate Bridge”. Others were less expected, like “sycophantic praise”, “inner conflict”, or “code vulnerabilities”. No human had drawn these lines. They emerged from training on a corpus of human writing, propagated to every downstream task the model performs, and remained, until very recently, almost impossible to read out.

This entry picks up where a previous note on classification stopped. There I traced the philosophical arc, from Plato’s joints through Mill’s natural kinds, Goodman’s grue, Hacking’s looping effects, Foucault’s institutional categories, and Bowker and Star’s infrastructures. The argument settled on a discipline of attention, the practice of keeping the line in sight as a line. Here the question narrows. What changes when the carver is a machine?

Cycle time of category revision across six configurations, from decade-scale DSM revisions to seconds-scale chatbot identity loops

Lines drawn from labels: supervised classification

The technique the iris dataset made canonical, supervised classification, has not changed shape since Ronald Fisher’s 1936 paper. Take a body of examples, attach each to a category label, train a model to predict the label from the features, deploy it on examples it has not seen. The label is given. The boundary is learned. Most of industrial machine learning runs on this template. Its weakness is that the labels someone, somewhere decided to assign carry forward into everything the model does afterwards.

Kate Crawford and Trevor Paglen’s 2019 audit of ImageNet is the canonical demonstration. ImageNet is the foundational computer-vision benchmark, with 14 million labelled images across 21,841 categories inherited from the WordNet lexical database. The label set was assembled by automated scraping plus low-paid crowd workers tagging images, and pushed straight into public release without any institutional review of whether the categories themselves were acceptable. Crawford and Paglen showed it carrying racial, gendered, and pejorative classifications, including taxonomies of human “kinds” that sit somewhere between psychiatric diagnoses and slurs. The labels shaped what every downstream computer-vision model learned to see, and every downstream system trained on those models inherited the framing. In Atlas of AI (Yale University Press, 2021), Crawford pushes the argument further. The infrastructure that produces machine learning, from labelled training corpora to data-labelling labour markets to the geographies of compute and the rare-earth mines that supply it, is shaped by political and economic choices that the systems themselves render invisible.

The 2016 ProPublica audit of COMPAS, the recidivism risk tool used in US sentencing, ran the same logic at higher stakes. An algorithmic decision boundary produced different false positive rates by race, drawn against a target (“high risk”) that was itself a category built from court records already encoding selective enforcement. The line came back to a judge as a sentencing recommendation. By the time anyone in the loop paused on it, the line had a body on the other side.

The technical follow-up went deeper. Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2017) independently proved a mathematical impossibility result. Imagine two populations with different actual rates of the outcome the model is trying to predict, say, two demographic groups with different historical reoffending rates in the training data. Any classifier trying to score risk in both groups must trade off against three natural definitions of fairness, and cannot satisfy more than two at once. First, a risk score should mean the same thing for everyone, so a 70% score corresponds to a 70% reoffending rate whether the defendant is black or white (called calibration). Second, innocent people should be flagged as high-risk at the same rate across groups (equal false positive rates). Third, genuinely high-risk people should be missed at the same rate across groups (equal false negative rates). All three feel like fairness. The proof shows they pull in incompatible directions when the underlying rates differ. A model that calibrates equally across groups will end up flagging more innocents in one group than the other. A model that equalises false positives will end up calibrated unequally. The three criteria define a Pareto frontier in the space of fairness definitions, and any classifier with non-trivial accuracy must pick a position on that frontier. COMPAS was not a buggy model that better engineering could fix. The disagreement about whether the tool was racially biased reduces to which of the three definitions one chooses, and that choice is irreducibly normative.

What can be done about this, given the impossibility, is partly a documentation problem. Timnit Gebru and colleagues’ Datasheets for Datasets (2018, formal publication 2021) and Margaret Mitchell and colleagues’ Model Cards for Model Reporting (2019) are attempts to make the upstream choices traceable before deployment. A datasheet asks for the provenance, composition, intended use, and known limitations of a training dataset. A model card asks for the same about a trained model. Neither resolves the impossibility theorems. Both make the choices behind the line visible, which is at least a precondition for arguing about them.

Lines drawn by similarity: unsupervised classification

Supervised classification at least keeps its labels visible to audit. Unsupervised classification does not. k-means and its descendants partition a dataset into clusters of points that look similar in feature space, but “similar” is itself a choice the algorithm cannot make on its own. The engineer picks the algorithm’s hyperparameters¹, the upstream choices that shape what “similar” will mean before the algorithm runs. Which features to include in the analysis. Which distance metric to use (Euclidean, cosine, Manhattan). How many clusters to look for. How to scale the variables. Where to seed the initial centroids. The algorithm then finds the partition that best satisfies those constraints. Run k-means twice with a different random seed and the clusters can shift. Drop a feature or change the metric and they shift more. In high-dimensional data the geometry breaks down further, with all points becoming roughly equidistant from each other and “similar” losing most of its grip (the curse of dimensionality). The boundary is no less drawn, only drawn through the joint action of the engineer’s choices and the data’s geometry, with neither party in full control of the outcome.

Lines learned from training: foundation models

The newest twist comes from foundation models. A trained transformer represents text, images, or molecules as points in a high-dimensional embedding space, where proximity encodes a similarity learned from billions of examples. The model has not been told what the categories are. It has built them in the geometry of its latent representations, with regions of the space corresponding to clusters of meaning the training data implicitly contained. When the model is then prompted to classify a sentence as positive or negative, or an image as a cat, it imposes the requested partition on its own private geometry. The categories are no longer drawn by humans at all. They emerge from training, propagate to downstream tasks, and remain almost impossible to audit by reading the code.

Reading the categories back out: mechanistic interpretability

The most recent line of work tries to recover those categories from the inside. Inside a foundation model, every input passes through layers of artificial neurons, each producing a numerical activation. The activations are dense and distributed, with any single concept spread across many neurons and any single neuron typically responding to several apparently unrelated concepts at once (a property called polysemanticity). Mechanistic interpretability is the project of decomposing this tangle back into clean, named concepts a human can read. The most-used technique is the sparse autoencoder (SAE), a smaller network trained to express each activation as the sum of just a few features drawn from a much larger dictionary, on the principle that the underlying concepts are sparse even if the neuron-level patterns look dense. The intuition is that of a code book. Even though the original signal looks like noise spread across thousands of neurons, only a handful of meaningful concepts are firing at any given moment, and the SAE tries to identify which ones. Train the autoencoder on the model’s activations under a sparsity penalty, and it will learn a vocabulary, often tens of thousands or millions of entries large, where each entry corresponds to a recognisable concept and only a small subset is active for any given input. The Anthropic mapping mentioned at the start of this entry was produced this way, with each of those 34 million features a candidate name for one of the model’s internal categories.

The technique is contested. A January 2025 consensus paper from twenty-nine researchers across eighteen organisations laid out the open problems. Two cut at the heart of the method. The first is reconstruction loss. The standard test of whether the SAE has captured what the model is doing runs in three steps. Take the model’s original activations on some input. Pass them through the trained SAE, which extracts a sparse list of named features. Reconstruct an approximation of the original activation by adding the discovered features back together. This rebuilt activation is what the SAE thinks the model “really” represents, stripped down to its meaningful components. The diagnostic is to plug this reconstruction into the model in place of the original activation and see whether the model still produces the same answer. It does not. The model’s accuracy on downstream tasks drops by 10 to 40 percent. Whatever the SAE has not captured is doing real work, which means the named features are not a complete description of what the model is computing. The second problem is the random-network finding. The same SAE methods, applied to networks whose weights have never been trained at all (still set to their random initial values), still extract apparently coherent features with apparently meaningful labels. The technique is producing named clusters whether or not the underlying model contains real structure. Google DeepMind has publicly pivoted away from sparse autoencoders toward what it calls pragmatic interpretability, looking for whatever interpretive method actually predicts the model’s downstream behaviour, even at the cost of clean human-readable features. Pragmatic methods include probing classifiers (small models trained to detect whether a concept is present in a hidden state), causal interventions (deliberately editing activations and measuring how outputs change), and circuit-level analysis that traces the path from input to output through specific attention heads and computations rather than trying to enumerate features. The recovered partitions are real enough that the embedding space is no longer pure noise, but whether those joints cut the world or only the training corpus is the question that remains open.

The loop at industrial speed

Hacking’s looping effects, in a 2026 deployment, run faster than the institutions designed to govern them. The loop he described, where a category reaches the people it labels and they then change because of it, used to run through scientific journals, talk shows, psychiatric handbooks, and the slow accumulation of self-advocacy movements, across decades. The current configuration runs through foundation models, in seconds.

The mechanism is amplified by automation bias, the well-documented tendency of human users to weight machine-generated outputs more heavily than equivalent human judgements. A radiologist trusts the AI second-read. A loan officer defers to the credit model. A content moderator accepts the classifier’s verdict. Decades of human-factors research, from aviation autopilots to clinical decision support, show that systems perceived as “the machine’s answer” routinely override professional intuition even when the machine’s judgement is wrong. The implication for classification is heavy. A foundation model trained on historical data inherits whatever biases that data encoded, produces categorisations that look authoritative because they came from a system, and feeds those categorisations into institutional decisions where their machine origin makes them harder to challenge than the human classifications they replaced. The categories that emerge from the model thus become more naturalised, not less, than the categories that preceded them. Hacking’s loop now has a sycophantic intermediary that lends the appearance of objectivity to whatever pattern the training data already contained.

When the classifier is a foundation model that hundreds of millions of people consult for advice, diagnosis, summary, and judgement, the loop he described, which once took decades, now closes in seconds. A user brings an ambiguous experience to a chatbot, the chatbot returns a category, the user updates their self-description, the user’s writing then enters the next training corpus, and the next generation of model trains on a corpus already saturated with the prior generation’s classifications. Hacking imagined the loop running through scientific journals, talk shows, and psychiatric handbooks across decades. The current configuration is the same loop running through the LLMs themselves, with each cycle a few seconds long, and the original namers (the model trainers, the prompt engineers, the deploying companies) further from the people the categories now reach than any DSM committee ever was. The looping effects of human kinds, in 2026, are no longer mostly produced by professionals naming a syndrome. They are produced by autoregressive models predicting the next token, deployed at the scale of a search engine.

The cases are already visible. The TikTok-driven self-diagnosis wave around ADHD and autism, documented since 2020 by clinicians in the US, UK, and France, has migrated to chatbots. Users present clusters of symptoms, the model returns the matching DSM-style label, the label gets adopted as identity. Practitioner surveys in 2024-2025 report a sharp rise in patients arriving at first appointments with chatbot-confirmed self-diagnoses already in place. A parallel pattern around mental-health and gender categorisation has surfaced in clinical and policy discussions, with chatbots functioning as the first interlocutor in identity formation for adolescents who would once have spoken to a peer, a teacher, or a clinician. The category returned by the model becomes formative in proportion to the absence of competing voices, and the loop tightens further when users feed their reframed self-descriptions back into subsequent prompts.

What remains for the algorithmic case

The discipline of attention I argued for in the previous entry is harder to maintain when the classifier is a machine. The line is no longer drawn by an expert or a committee whose members can be questioned, on a timeline that admits revision, against alternatives that remain visible. It is drawn through the joint action of training data, optimisation objective, hyperparameter choice, and architectural quirk, by parties many of whom are unavailable to challenge. The features the model has built are recovered, when they are recovered, by techniques that themselves are contested. And the loop running through hundreds of millions of users compresses the temporal scale of human-kind feedback from decades to seconds.

The law of the instrument cuts harder when the instrument is a model. The model’s hammer has the largest catalogue of nails ever assembled, refines its catalogue with each new generation of training, and has no other relation to the world than the categories it brought to the encounter. The diagnostician with the manual at least could leave the manual at the office. The model is the manual.

Three modest moves earn their keep, given the constraints.

Audit the labels. Crawford and Paglen for ImageNet, ProPublica for COMPAS, the documentation movement of datasheets and model cards trying to make label provenance traceable before deployment.
Name the impossibilities. The Chouldechova-Kleinberg result is the canonical example, and the argument it forced (that the choice of fairness criterion is normative, not technical) is more useful than any individual fix.
Treat mechanistic interpretability as research, not yet as audit. The features sparse autoencoders surface are real findings and should not be confused with a complete account of what the model is doing. Notice especially when the recovered categories are alien rather than familiar. The model has built joints we do not have words for, and the temptation to translate them into our existing vocabulary is the move the previous entry warned about.

It is also important to acknowledge where machine classification has earned its place. Medical imaging is the example with the strongest empirical track record, with deep learning systems now outperforming radiologists on several specific detection tasks. Raidium², the Paris-based foundation model company, ranked first across 18 CT diagnostic targets at the CVPR 2026 Workshop Challenge with its Curia-1 model, ahead of comparable systems from Stanford, Harvard, and Toronto. The clinical value of high-precision, consistently performing classifiers on bounded medical tasks is real and growing. Fraud detection, accessibility tools, drug-target identification, and protein structure prediction (AlphaFold’s now-canonical case) sit on the same earned-categories side of the ledger. The audit’s job is to discriminate between cases where machine classification works and cases where it ports human bias forward at scale.

Machine learning will not solve the underlying problem of classification. Categories remain constructed, performative, and political even when produced by a model. What the technology offers, used intentionally, is a different shape of opportunity. The categories of an institution like the DSM ossify between editions and revise on decade timescales; the categories of a well-instrumented model can be re-examined, re-trained, and re-deployed in a continuous cycle. Where DSM-III imposed discrete operational checklists because the mid-century alternative was unfalsifiable narrative, a modern dimensional model can carry the resolution that HiTOP and RDoC are reaching for without forcing a return to either categorical buckets or psychoanalytic prose. The architectures that gave us mechanistic interpretability give us the audit tools, in principle, to inspect and contest the categories the system has built.

The condition is intentionality³. The opportunity is real only if the categories are continuously audited, the impossibilities continuously named, the interpretability work continuously pushed, and the institutional incentives aligned to make explainability a deployment requirement rather than a research aspiration. The default trajectory, in the absence of those conditions, is the one this entry has spent its length describing. Looping effects at the scale of a search engine, with a sycophantic intermediary lending the appearance of objectivity to whatever pattern the training data already contained.

Carving the world is what produces the kinds we then think with. The classifier who keeps the carving visible does less harm than the one who has stopped seeing it. The institution that builds explainability into its deployments does less harm than the one that ships a black box. The discipline does not become impossible when the classifier is a machine. It becomes the precondition for the technology to build a desirable future.

[Definition] In machine learning, the parameters chosen by the engineer before training, as opposed to the parameters the model itself learns from data. The number of clusters in k-means, the learning rate of a neural network, the depth of a decision tree, the regularisation strength, the random seed: each is a hyperparameter. They sit one level above the parameters of the model, hence the prefix. ↩
[Disclosure] Raidium is a portfolio company of Galion.exe, the venture firm I co-founded. ↩
[Expansion] The contemporary AI-policy literature increasingly uses agency for what I am calling intentionality here, often in connection with debates around AI agents and human oversight. I prefer intentionality because it carries the older sense of deliberate, attentive practice without the ambient confusion of “agency” in current AI discourse, where the same word is used for autonomous-system capability and for human deliberative control. ↩