In 1981, working as an electronic design intern at the headquarters of the New Zealand Forest Service, I was given exclusive access to a personal computer for the first time in my life. The hours I spent playing with the Tandy Radio Shack TRS-80 that summer extended far beyond the work I was paid to do. Among other experiments, I used a random number generator to combine lists of nouns, adjectives and verbs into unexpected sentences, with occasionally entertaining results. Applying (I hope) due irony, I called this experiment an "automatic novel generator". I had forgotten this episode, in part from embarrassment. But it did slightly impress my new girlfriend - enough that she remembered it after 40 years (of marriage), and suggested it might be relevant to this book.
My sophomoric novel generator was by no means a demonstration of AI, and if I’d been better educated, I would have been able to relate it to centuries of experimentation in which new sequences of words had been mechanically constructed, as a kind of aid to creativity or source of oracular guidance.
Twenty years later, to prepare for an urgent call to let me know our first child was on the way, I’d recently bought my first cellphone. Entering SMS text on a tiny keyboard was a pain, and my late friend David MacKay showed me the first prototype of his Dasher software which promised to make text entry far easier, especially for people with disabilities. David had invented Dasher after a conversation on a bus journey at the annual conference on Neural Information Processing Systems (NIPS, now NeurIPS), lamenting the inadequacy of those text interfaces. Based on animated diagrams illustrating the relative likelihood of the next letter in a sequence of English text, David realised that the animation could become a user interface for faster text entry. For example, after typing the letters t, h, and i, the next letter is quite likely to be “s”. In the animation, the screen would then zoom in so that “s” can be entered very easily indeed.
David and I failed to persuade phone manufacturers that this approach to faster text entry (in the dark ages before the iPhone and other touch-screen devices) would become attractive. However, the basic principle of “predictive text” has since become a familiar everyday experience. In the early days, text prediction was little more than an in-built dictionary, able to complete a word once the first few letters had been entered, or auto-correct a sequence of letters that was close enough to a known word. Systems like Dasher were not constrained by word boundaries, meaning that even after a word was complete, the next few letters of the following word could be predicted. Nowadays, we are all familiar with document editors and email applications that will predict the next word in a sentence, or the rest of a phrase, all based on a language model that has been statistically trained to know the probability of a given letter or word coming next in a sequence of text.
Another 20 years after my experiments with David MacKay, I was invited to contribute to a training course in Digital Humanities, teaching graduate students in language, history and media studies how to process text as statistical sequences. In the training examples prepared by my colleague Anne Alexander, students wrote Python programs to compile statistical tables of how often any given word is seen to follow other words in different kinds of text. One example we experimented with was the collected Tweets of then-president Donald Trump. Another was the collected works of William Shakespeare. Those different frequency tables could be used to generate predictive text in the style of either Trump or Shakespeare, completing sentences using the particular kinds of words those men might use. These simple (but entertaining) student exercises, like David MacKay’s Dasher, are statistical language models, capturing particular ways of speaking, and using their statistical data to produce new text that resembles what the model was trained on.
I was naturally curious to see whether my own habits of writing could be captured in a language model. While I wouldn’t like to place my own writing in a precise location between the extremes of Shakespeare or Trump, in my academic career I have probably written a comparable number of words to both. I spent a few hours collecting manuscripts spanning more than 10 years, and Anne used them to construct a language model fine-tuning the experimental platform GPT-2 (later versions of this OpenAI software, including GPT-3, ChatGPT, and GPT-4, have since become far more famous than the version we used in 2021). It turned out that this “Auto-Alan”, as we called the language model, was able to write convincingly in the same style as me, just as we had previously demonstrated automated output that sounded like Trump or Shakespeare. Auto-Alan “wrote” a short academic paper that we presented to an audience under the title Embracing the Plagiarised Future, anticipating the widespread soul-searching that would arise within the next couple of years. The following year, I read Auto-Alan's paper to a research workshop attended by my PhD supervisor Thomas Green, who invited his friends to consider how much of my previous work might also have been automated.
I don’t think these three little stories about my own experiments at 20-year intervals are particularly impressive. Any enthusiastic amateur at those times might have been doing the same kinds of things. However, my personal experiences provide useful landmarks as an overview of how language generation technology has developed over 40 years. Although there have been impressive advances, these can largely be attributed to the increase in computing power described by Moore’s Law. Typical computers in 2021 had become about a million times more powerful than typical computers were in 1981. This is similar to the difference in body weight between me and an ant, so you might say the intellectual advances of AI in these 40 years are disappointing by comparison to the huge scale of physical infrastructure and investment they have required.
Nevertheless, there are some principles that have stayed the same over 40 years, and are therefore quite likely to continue for another 40 years. It’s worth paying more attention to these.
Firstly, all these kinds of experiment have involved a language model. In the simplest case - my novel generator of 1981 - the model consisted of some lists of words and a few simple grammar rules. As I explained in the last chapter, that GOFAI era involved collecting facts and writing down rules, an activity often described as “knowledge engineering”. In contrast, the language model of Dasher was trained statistically from example texts, not by writing out rules, and was therefore completely determined by the examples of text it had been shown. If shown biased and offensive text it would suggest that kind of vocabulary, but if trained with the complete works of Jane Austen (as we did in one experiment), it would predict phrases from the 18th century. David Ward, who developed Dasher for his PhD, originally trained the model with a large archive of his own email messages. This might have been an effective productivity aid when chatting to his friends, but could be embarrassing when potential sponsors watched the model predict his habit of swearing in emails.
The basic language model of GPT-2, and its various successors, are trained using huge amounts of text collected from the internet. It is so expensive to process this much text, and the commercial investment so sensitive, that the companies building these models are secretive about exactly where they get the training text from. We can imagine that, for example, it would be easy to include the whole of Wikipedia as part of the language model. We can gain some insight into the behaviour of such models by “fine-tuning” them as I did, when I trained GPT-2 to give more statistical weight to the kinds of language in my own academic writing.
Secondly, an important principle is that these language models are used for prediction. If I train a language model with a statistical survey of the kinds of words Donald Trump used in the past, I can use this to predict what he might say in the future. So far, this particular prediction has been fairly reliable, because Trump is still saying the same kind of things. Similarly, Auto-Alan could be quite useful to me as a kind of predictive text keyboard, because its language model includes a dictionary that gives more priority to the kinds of words I’ve used in the past, and that I’ll quite likely continue to use in future.
The origin of the training text is really important, when a language model runs without further human intervention. Imagine that, when composing a message on my phone, I decided simply to accept the next word suggested by my predictive text keyboard, whatever that was. This would result in a message containing the kinds of words that people use on phones, arranged in plausibly the same order that people do arrange those words, but it would not be a message from me at all! It would simply be an exhibition of what is contained in the language model of my phone. I quite often did this experiment with Dasher - just pressing its accelerator to continue generating text, outputting whatever letters it considered most likely on the basis of its training, which was entertaining but seldom useful.
The third general property we can see over these 40 years is that randomised text output, even from simple language models, can sometimes be entertaining or interesting. My “automatic novel generator”, although a trivial program, still created valid English sentences. The sentences were repetitive, and every word a non-sequitur, but occasionally the results were funny or thought-provoking. The predictive text output from Dasher, even when suggesting words I hadn’t been intending to write, was also often interesting. Many people have similar experiences with predictive text. While most suggestions are boringly predictable, random mistypings might result in entertaining alternatives. The same things continued to be true when I saw the output from Auto-Alan. Some sentences were almost directly copied from things I had written before, while others included words that I do like to use, but combined in ways I hadn’t seen before. Although I haven’t done this (yet), I can imagine that some of those unexpected combinations of words might suggest ideas for future research projects, just like Wiliam S. Burroughs cutting up his manuscripts. I was amused to see that the text created by Auto-Alan included references to previous academic work, including a paper supposedly written by two friends of mine, despite the fact that those two people had not ever worked together. The suggestion by Auto-Alan made me wonder whether I should introduce them, because any result of their collaboration would certainly be interesting.
In a classic publication that first alerted many readers to the now-familiar problems of large language models, Emily Bender, Timnit Gebru and colleagues described these systems as “stochastic parrots”. One of the main concerns of the stochastic parrots authors was to draw attention to the fact that these experiments were being conducted at massive environmental cost, through the amount of computation required to extract and model so much of the Internet. The energy usage of large language models continues to be a pressing problem, both practically and ethically, as we ask how much benefit really comes from such experiments and at what cost. But the most immediate impact of that paper may well have been its brilliant title, which so effectively summarised the fundamental principles of operation. The text that LLMs so fluently produce creates the illusion that it might have been directly written by a human author. But the description “stochastic parrot” drew attention to the way they really work.
As shown in my own experiments over the course of 40 years, text produced mechanically from language models can be very engaging. But the content of such text is only ever a re-ordering of the text that was previously used to train the model (they are parrots), spiced up with some random elements that will be more unexpected (“stochastic” is simply a technical term for randomness).
In the next chapter, I will say more about the capabilities and opportunities that come from encoding so much of the Internet in a language model. At the time I write this, newspapers and broadcast journalists describe new advances and controversies about LLMs almost every day. Perhaps the excitement will have died down by the time this book is printed - many of my friends think me both brave and foolish, for trying to explain such a fast-moving field via the old-fashioned medium of book publication. Some of my less critical students suggest that by then, an AI-based LLM could write the whole book automatically, saving both me and my publisher a great deal of effort.
It would be foolish, at a time of such rapid advance, to make confident predictions about what can’t be done in the next few years. However, one reason for starting this chapter with a 40-year history is to emphasise that the human interpretation of computer systems changes relatively slowly. This is not because technical advance has been slow. Although a million-fold increase in four decades is impressive, this is only two human generations. Social acceptance of new technologies happens on a scale of multiple generations, not product seasons, in part because science famously advances one funeral at a time.
Before going on to say more about the opportunities of human interpretation, it is worth reinforcing some of the technical limitations that will continue to define the capabilities of any predictive text models, based on the principles of information theory developed by Claude Shannon at Bell Labs. Shannon’s theory is now recognised as a fundamental principle of all information technologies, well beyond communication systems such as the AT&T telephone network that originally funded Bell Labs. Information theory is perhaps the most significant practical advance in mathematical physics since Isaac Newton’s laws of gravity. It is not yet taught in high schools, but certainly will be before long. And as far as LLMs are concerned, I believe that trying to explain their capabilities without the mathematics of information theory is like attempting a proper theory of how balls move on a pool table without using Newton’s equations.
In its simplest form, information theory measures how much information is transferred over a communication channel. Although this seems straightforward, there are subtle, almost paradoxical implications. Seventy years after Shannon, we know how to measure data in megabits or gigabits when we stream a movie or email a photo. But the paradoxical aspect is that not all data counts as information. If you send me an email message (please do), then immediately send the same message again (please don’t), the second one uses more data, but without giving me any more information – the new data is redundant in the technical jargon of information theory. On the other hand, if you send the same email to a different person, now the same message is notredundant, because that recipient hasn’t seen it before. While it is easy to measure how many bits are transmitted through a cable, information content is hard to measure objectively, because it depends who receives the message. If the receiver already knows the contents, no information has been transferred. Information theory measures how much information the receiver did not already know. Information theory is a measure of surprise!
As I discuss in more detail in chapter 12, what we call “creativity”, when done by a machine, is more precisely a measure of how much we are surprised by what the machine does. So information theory can be used in some way as a measure of creativity.
An apparent paradox of information theory is that the most expensive message to send over a communication channel is a completely random sequence of numbers. Imagine a message that is composed of zeros and ones, where every bit is chosen by flipping a coin. The person receiving that message would have no way to predict what the next bit is going to be, meaning that every bit that arrives is a complete surprise. That message is as surprising as it could possibly be, but we don’t perceive it as being creative. On the contrary, a sequence of random bits is heard as noise (another technical term in information theory).
Machine learning systems, by definition, learn to replay information that they have received (their training data). Shannon and Weaver, in their original publications on information theory, paid close attention to the concept of a language model in understanding how much information is being transferred in human communication. When a language model is used for text prediction, each new letter is chosen on the basis of information theory: the principle of least surprise. The word your phone predicts is the one that is least surprising, or most expected to come after what you just entered. The least surprising word is also the least creative. Indeed, who would want a more creative predictive text algorithm on their phone? A creatively “intelligent” phone, which surprised you with words completely different to what you were expecting to say, might be quite a liability in everyday use!
I’ve already explained that the most surprising message (in an information theory sense) is a sequence of completely random numbers or coin tosses, since that would mean there was no way to predict any number from the ones that came before. Completely random messages are very surprising, but paradoxically very uninteresting, because they are just noise, communicating nothing at all. There is no hidden message in a series of coin tosses or dice throws, no matter how much we might want to find one. Random information is perceived as surprising precisely because there is no message that could have been predicted. An AI system can produce surprising output if its design includes random elements. But this is not creative in the human sense, because it is not a message from anywhere. Random elements in a digital sequence are not a signal, but noise. As described by philosophers of mind, a random message has no meaning because there is no intention behind it.
I hope this explanation has made it clear why “stochastic parrots” is such an accurate and succinct summary of the nature and capabilities of large language models, and also of my own experiments with text generation over the previous 40 years. A system that generates text from a language model is a Shannon information channel that can output only information that has been put into it - a parrot. No language model has ever created new information, and none ever will, any more than we will ever see a perpetual motion machine. The output of language models does become more interesting when we add random noise, making them stochastic. This can never be new information, but might seem interesting because it is unexpected, giving the human reader the challenge of how to interpret what they see, as I will discuss in the next section.
Predictive text from language models is certainly useful, especially if the language model contains enough information that the text you were already wanting to write is contained in the model. I use simple predictive text every day, to help write familiar words quickly (it’s very rare that I want to write a word that is not in any dictionary, which is why it is usually helpful to have words completed or my spelling corrected automatically), and to complete sequences of words (the English language includes many small words that do not have interesting meanings in themselves, but have to be included in the right order to make grammatical sentences - in this sentence, they included “the”, “that”, “do”, “to” etc).
I’ll say more about the benefits (and dangers) of relying on predictive text in the next chapter. This is the “parrot” part of the story, and is all about automating routine actions - the useful capability of computers to “do the rest” for us, by repeating, parrot-like, variations on things they have seen before. But this has also been recognised as a shortcoming of computers for over 70 years, prompting a letter to the Times of London that defended the reputation of parrots against a Professor Jefferson whose Lister Oration on the “mechanical brain” had recently argued: “It was not enough to build a machine which could use words; it would have to be able to create concepts and find for itself suitable words in which to express them. Otherwise … it would be no cleverer than a parrot”. The letter-writer complained that this was unfair to parrots, and that “Unless it can also lay eggs, hang upside down from its perch, scratch itself in unlikely places, and crush the fingers of unwary visitors in its powerful beak until they scream in agony, no machine can start drawing comparisons between its own intellect and a parrot’s.”
Before saying any more about parrots, I want to spend a little more time considering the value of the “stochastic” element. These are the times when systems built using language models add random noise, mixed in with the useful signals that have been parroted from their training data. As I’ve already described, it can be entertaining to see random stuff that is output from a computer. Sometimes random output is interesting, while sometimes it’s not. But the difference between interesting and not interesting really depends on what you were expecting, or how you interpret it.
Interpreting random information is an ancient game, entertainment, and even tool for decision making in many societies. A familiar example is the toss of a coin at the start of a football match. Although all kinds of factors could be used to decide which team kicks the ball first, the number and variety of contextual questions that might be considered is so large and complex that the debate could last longer than the game. Making a decision on a random basis is therefore an effective decision strategy. Many professional and scientific decisions are made in the same way, such as the Randomised Controlled Trial that is routine in the pharmaceutical industry.
A heads/tails toss or an A/B test leaves little room for interpretation, but it’s useful to compare large language models to other randomised processes where individual symbols have more meaning. Fortune-telling games like horoscopes or tarot cards are interesting examples. The signs of the zodiac, and the elaborate symbols of the tarot deck, have evolved over generations to include many layers of meaning. A skilled “reader” of horoscopes or tarot cards is able to weave together these richly ambiguous symbols into a story about someone’s life and imagined future destiny. The combination of complex symbols with randomised shuffling allows the reader to interpret them in an evocative improvised performance.
There is little opportunity for creative interpretation when tossing a coin, especially after a sequence of tosses simply evens out into similar numbers of heads and tails. But randomised juxtaposition of meaningful symbols becomes an opportunity for interpretation. When combined with interpretation, the “signal” comes from the combination of the traditional meanings those symbols have acquired, together with the skill of the interpreter, while the randomisation process of shuffling is “noise” that has been intentionally introduced for much the same purpose as the coin toss at the start of a football game - if you don’t have all the information needed to make a fair judgement, it might be fairer to make a random one.
It’s important to understand the value of noise in these performances, since this is the basis on which the whole “system” (cards, symbols, reading etc) provides outputs that are unexpected. The whole point of a divination performance is to tell you something that you didn’t expect. In terms of information theory, this is precisely what noise is good for. For example, there are situations where robot control can be improved by adding noisy “jitter” to prevent the control algorithm getting stuck. However there is room for misinterpretation, if anybody confuses the random noise with the signal itself. Some traditional divination practices tell interpretive stories in which the random outcome is described as a message from a spirit, ancestor or god, while astrologers and tarot readers might describe the random elements as a message from “fate” or “destiny”. These supernatural characters are an important part of the interpretive story, combined with the cultural associations of the symbols, but the interesting message is the interpretive performance, not the random noise.
These familiar traditional practices can be compared to the apparently supernatural or magical powers of the latest large language models. The “stochastic” part of the stochastic parrot output is noise, not signal. It does not carry a message from anywhere, because it is not a message. But because the symbolic contents of the model (nearly the whole of human language) are so rich, the combination of these symbols might be interpreted as if the random noise were a message from somewhere mysterious. Some popular writing on large language models describes this mysterious capability as the emergence of “consciousness,” “sentience” or “artificial general intelligence”, but a fortune teller might prefer the terms “fate”, “the spirit world”, “the cosmos” or whatever. From a technical perspective, the terms are pretty much interchangeable, since none would make any difference to the mathematical construction of the model itself.
It’s worth mentioning that there is also an historical tradition of performance associated with purely mechanical AI - the “parrot” part of the stochastic parrot formulation. These are performances where a human actor controls some kind of mechanical puppet or costume, so that real messages from a real human (the actor) appear to be coming from a machine. One of the most famous examples, often mentioned in histories of AI, is the “Mechanical Turk” created by Wolfgang von Kempelen in the 18th century. This chess-playing automaton was a sensational success with audiences across Europe for decades. Presented as a magic trick or fairground-style attraction, it supposedly demonstrated the wondrous advances of the mechanician’s art, along with the frightening prospect that machines might become the intellectual superiors of humanity. Much the same as the AI messages of today, in fact.
Of course, as educated audiences at the time were well aware, these performances were magic tricks, involving an expert chess player of small stature who hid inside the robot costume. Although the automata of the 18th century were impressively elaborate moving sculptures, they had none of the computing elements that would be necessary to calculate the rules or strategies of chess. The most impressive part of the Mechanical Turk was the costume, not the intelligence.
In my opinion, the same is true of the AI demonstrations of today. One of the world’s largest AI companies, Amazon Web Services, actually calls one of its products Amazon Mechanical Turk, so this is not even a secret they are trying to hide. The whole point of AMT (as it is widely known) is to commission humans to hide insidethe impressive mechanical costumes of modern computation. AMT workers, or “Turkers” are available on-call for any moment that a computer AI system encounters a hard problem - called a Human Intelligence Task or HIT. Many AI companies have to decide which problems they could solve with an expensive algorithm, and which would be more cheaply completed by sending them as a HIT to one of the Turkers.
It’s not difficult to build a fraudulent “AI” system, just like the original Mechanical Turk, where the software does nothing of any complexity, and almost all the interesting behaviour is implemented by hiring a Turker (or a contractor from one of many competing “crowd-sourcing” brokerages) behind the scenes. There have been numerous cases where AI companies simulated the advertised capabilities of their software with hidden human labour. In one of the most embarrassing, the hidden humans were not even Turkers, but research PhDs at a start-up who had been told by their bosses they were fine-tuning an AI prototype, but eventually figured out they were responding directly to customer requests.
Feeding a question from a customer directly to a hidden human, while pretending the answer comes from an AI, might seem fraudulent, but there are more subtle cases. AMT is often used, not to answer customer’s problems in real time, but to provide the answers to typical historical cases of the problem as “labels” defining how that case ought to be treated in future. AMT work is so cheap that companies can afford to present many thousands of hypothetical problems to the Turkers, storing every answer as a “label” that might be replayed in future when a similar case is seen again.
The creation of huge labelled datasets, to be used as examples of intelligent behaviour, is known as “supervised machine learning”, and was the underlying practice that started the deep learning revolution with the publication of the ImageNet dataset in 2009. The creators of this training dataset, led by Fei Fei Li, collected immense numbers of photographs from the Internet, and employed thousands AMT Turkers to create labels for every one by selecting words from the WordNet dictionary. The resulting database of pictures and human labels was used to train “neural” networks that were able to replay the appropriate labels when they were shown new pictures similar in some way to the training examples that shared the same label.
Although widely celebrated as a revolution in AI, these image classifier systems were clearly mechanical puppets in the tradition of the original Mechanical Turk. Although the human judgements might have been commissioned and stored in advance, rather than fraudulently redirected in real-time, a puppet whose motions are based wholly on stored behaviour does not seem too much different in principle from a puppet that is controlled by mechanical linkages and recorded cam profiles, such as the amazing handwriting automaton created by by Pierre Jaquet-Droz in the late 18th century.
We have centuries of history in which the performances of machine intelligence are most impressive for the variety and sophistication of ways in which the real human intelligence has been stored and hidden from the audience. Although the presenter might emphasise the wondrous spectacle by describing the machine as moving of its own accord, and with its own intelligence, this is all showmanship. If a presenter of Jaquet-Droz’s writing automaton were to insist that the machine were an author, that had composed the text by itself, this would ultimately be an act of plagiarism, since the original author was the person whose text had been copied (or perhaps the “author” of the handwritten letter forms, so artfully encoded into the shapes of the cams that would replicate them).
When AI researchers build supervised machine learning systems, then claim that the system itself is the author, rather than the AMT workers who provided the training labels, this would also seem a fairly clear case of plagiarism if the labelling process required any kind of original judgement. AI researchers do not very much like to talk to the general public about the training datasets that they use, and certainly don’t like to talk about the real humans who created that data. In fact, the whole point of AMT is to make the Turkers anonymous - it is against the AMT rules to ask any question that might reveal the Turker’s identity. This all seems consistent with so many other business practices, including the warehouses of Amazon’s shipping business. The company prefers the whole system to look like a magical robot, and for its customers not to think too hard about the lives or working conditions of the people making and packing the products.
This is why it is so significant to describe LLMs as “parrots”. Rather than focusing on the impressive performance of the mechanical parrot, we ought to remember that this is just a costume or puppet, hiding real humans who provide the actual intelligence. Pretending that the message comes from the parrot itself, and not from the human authors who taught it, is a kind of institutionalised plagiarism.
As LLM-based products get deployed more widely, we will learn a great deal more about the changing status of text, and textual labour. I won’t spend more time trying to guess what that future may look like. But before ending this chapter, I’d like to consider the historical change from the clearly technical nature of early programming languages, to the more ambiguous status of “natural” language text when used for purposes that might previously have required programming. I will be discussing the evolution of programming languages later in chapter 13, but it’s worth remembering that for many years, any kind of human interaction with computers was considered to be “programming”, for example in Gerald Weinberg’s classic book from 1971 The Psychology of Computer Programming. His title did not really refer to programming as we understand it today, but to issues around information systems, human-computer interaction (a term that wasn’t widely used until the influential work of Card, Newell and Moran a decade later), or even the broad context of interaction design and user experience (as it is called now).
The statistical methods of processing natural language that have led to the current LLM boom are distinctive in the way that they treat natural language text primarily as data about word sequences and frequencies, rather than language as the embodied practice of humans sharing a sound world with other humans. The science and technology scholar, and engineering educator, Louis Bucciarelli, in his ethnographic studies of engineers, observes how those with primarily technical training must navigate between on one hand the object-world of the artefacts they are building, and on the other, the world of social process within which products will be deployed and collaborative work must be coordinated. Engineers who work on natural language processing must both act as humans, using language as part of their professional social processes, and also as observers, measurers and theorists of language-as-data in their particular specialised object-world.
It would be quite easy, if speaking casually, to confuse the world of language-as-data with the practicalities of being human language-users, and indeed the original Turing Test invited a juxtaposition of the two in order to investigate questions about the nature of mind. However, the Turing Test, as with much of the framing of AI, relies centrally on placing language outside of the human body, with the disembodied text-processing capabilities of keyboards and screens substituted for embodied human voices. I noted in chapter 1 the observation by Devlin and Belton, that treating intelligence as independent of the human body is a position generally proposed by those whose own bodies are privileged as the invisible ideal or norm. People of colour, women, and those oppressed by religion, class or birthplace, know how often the words they say will be judged first by their bodies, and only second by objectively abstract principles of intelligence.
When human speech, with all the knowledge and context of a human lifetime, is turned into disembodied data, it gets reduced to components in a technological object-world. Pioneering anthropologist of AI Diana Forsythe identified the many ways in which human knowledge was appropriated by technoscience in a way that was fundamentally gendered, assimilating the voices of women into a technical construction of male expertise. Feminist historian of AI Alison Adam documents just how many of the origins of the discipline are founded in a stereotypically male-centred set of perspectives and priorities. In my own town, the pioneering work of Margaret Masterman, a delightfully maverick former student of Wittgenstein, is one example of an original thinker on language who was never included in the research mainstream of AI. Masterman’s independent Cambridge Language Research Unit continued for many years outside of the formal structures of Cambridge University, but her student Karen Spärck Jones, now celebrated as one of the most renowned female computer scientists in the UK, was already making significant critiques of statistical language models 20 years ago. Finally, many will have noticed that all the four authors of the Stochastic Parrots paper (which cites Spärck Jones’s own critique as an early point of reference), also happen to be women, while those currently arguing for public regulation of dangerously disembodied AGI appear to be almost universally men (who almost universally ignore the Stochastic Parrots critique in their pronouncements).
These deeper questions, about the structure and purpose of human language in relation to its statistical representation, deserve far more sophisticated unpacking, including both gender- and race-critical perspectives. But the limitations of treating word prediction as content are a central concern, as recently explained by another Wittgenstein scholar, Murray Shanahan. When systems are designed to predict words, rather than to describe facts or satisfy goals, they become able to simulate human conversation, but without explicitly incorporating the human context that makes ordinary conversation meaningful. What are they really for? Is mechanical language processing a Moral Code, in the terms of this book? When we consider the cost-benefit calculations of attention investment as introduced in chapter 2, why would I choose to spend my time interacting with a computer using human language, rather than the many alternative kinds of code? What is human language actually for? When we spend time “doing language” with other humans, what do we get out of it? What makes this investment of attention meaningful? Is it possible that LLMs might have been designed, not to reward attention, but as Herb Simon warned, to consume it? These questions are the focus of the next chapter.
 Jo L. Walton, "A brief backward history of automated eloquence", In Ghosts, Robots, Automatic Writing: An AI Level Study Guide, ed. Anne Alexander, Caroline Bassett, Alan Blackwell and Jo Walton (Cambridge UK: Cambridge Digital Humanities, PREA, 2050/2021), 2-11
 David J. Ward, Alan F. Blackwell and David J.C. MacKay. "Dasher - a Data Entry Interface Using Continuous Gestures and Language Models". In Proceedings of UIST 2000: 13th Annual ACM Symposium on User Interface Software and Technology. San Diego, CA, (2000), 129-137.
 Malcolm Longair and Michael Cates, "Sir David John Cameron MacKay FRS. 22 April 1967—14 April 2016". Biographical Memoirs of Fellows of the Royal Society 63 (2017): 443-465. https://doi.org/10.1098/rsbm.2017.0013
 Textual descriptions of Dasher are very hard to imagine, if you have not seen this unusual system operating. A demonstration by David MacKay, excerpted from a Google TechTalk, is available online (YouTube video, created 26 Oct 2007) https://youtu.be/0d6yIquOKQ0
 Anne Alexander, Caroline Bassett, Alan Blackwell and Jo Walton, Ghosts, Robots, Automatic Writing: an AI Level Study Guide. (Cambridge UK: Cambridge Digital Humanities/PREA, 2050/2021).
 Anne Alexander, Caroline Bassett, Alan Blackwell and Jo Walton (2021). “Embracing the Plagiarised Future”. Panel presentation at Critical Borders: Radical (Re)visions of AI, Jesus College Cambridge, 18-21 October 2021.
 From Jo and Thomas Green’s Christmas circular of 2022: "At a recent conference someone started his paper with a long piece written by an AI machine imitating his style. It sounded just like all his other papers. Far be it from anyone in this house to conjecture that maybe the others were written by AI too.”
 Diana E. Forsythe, “Engineering knowledge: The construction of knowledge in artificial intelligence”. Social studies of science 23 no. 3 (1993): 445-477.
 Kevin Schaul, Szu Yu Chen and Nitasha Tiku, “Inside the secret list of websites that make AI like ChatGPT sound smart”. Washington Post, April 19 2023.
 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell, "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜," in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (March 2021), 610-623.
 Max Planck, as cited in Thomas S. Kuhn, The Structure of Scientific Revolutions. (Chicago: University of Chicago Press, 1970), 151. Planck's hypothesis has been studied empirically: Pierre Azoulay, Christian Fons-Rosen, and Joshua S. Graff Zivin, "Does science advance one funeral at a time?" American Economic Review 109, no. 8 (2019): 2889-2920. doi: 10.1257/aer.20161574.
 Claude E. Shannon, "A Mathematical Theory of Communication". Bell System Technical Journal 27 no. 3 (1948): 379–423.; See also Claude E. Shannon and Warren Weaver. The Mathematical Theory of Communication. (Champaign, IL: University of Illinois Press, 1949).
 This exclamation point, like the excessive use of italics throughout the paragraph, was not really very surprising, or indeed very creative – this is another example of redundancy, and might easily have been suggested by an especially clichéd predictive text editing program, in which case I would not need to take responsibility for it.
 Shannon and Weaver, The Mathematical Theory of Communication
 Daniel C. Dennett, "Intentional Systems", The Journal of Philosophy 68, no. 4 (February 1971): 87-106; Daniel C. Dennett, The Intentional Stance, (Cambridge, MA: MIT Press, 1987)
 "Umbrage of Parrots," The Times (London), June 16, 1949, p. 5. The Times Digital Archive, link.gale.com/apps/doc/CS84625616/. Accessed 5 Aug. 2021. Thanks to Willard McCarty for drawing my attention to this piece.
 Simon Schaffer, “Enlightened Automata,” in The Sciences in Enlightened Europe, ed. William Clark, Jan Golinski, and Simon Schaffer. (Chicago: University of Chicago Press, 1999), 126–166.
 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database," in Proceedings of CVPR 2009 IEEE conference on computer vision and pattern recognition. (2009), 248-255.
 George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. "Introduction to WordNet: An on-line lexical database," International Journal of Lexicography 3, no. 4 (1990): 235-244.
 Couldry and Mejias, The costs of connection.
 Gerald. M. Weinberg, The psychology of computer programming. (New York: Van Nostrand Reinhold, 1971).
 Stuart K. Card, Allen Newell, and Thomas P. Moran. The Psychology of Human-Computer Interaction. (Hillsdale, NJ: Lawrence Erlbaum Associates, 1983).
 Yvonne Rogers, Helen Sharp, and Jennifer Preece. Interaction design: beyond human-computer interaction. (Hoboken, NJ: John Wiley & Sons, 2023).
 Louis L. Bucciarelli, Designing engineers (Cambridge MA: MIT press, 1994).
 Kate Devlin and Olivia Belton, "The Measure of a Woman: Fembots, Fact and Fiction".
 Forsythe, Studying those who study us.
 Alison Adam, Artificial knowing: Gender and the thinking machine (Abingdon UK: Routledge, 2006)
 Yorick Wilks, ed., Language, Cohesion and Form: Margaret Masterman (1910-1986). (Cambridge UK: Cambridge University Press, 2005) Another British A.I. Pioneer, Margaret Boden, describes her studies with the maverick Masterman in the preface to Margaret A. Boden, Mind as machine: A history of cognitive science (Oxford, UK: Oxford University Press, 2008).
 Karen Spärck Jones, Language modelling’s generative model: Is it rational? (unpublished manuscript, 2004). https://www.cl.cam.ac.uk/archive/ksj21/langmodnote4.pdf (accessed 13 May 2023)
 Shanahan, Talking About Large Language Models.
 Simon, “Designing organizations for an information-rich world”.