Skip to main content
SearchLoginLogin or Signup

Chapter 11: How can stochastic parrots help us code?

Published onJun 18, 2023
Chapter 11: How can stochastic parrots help us code?

Chapter 11: How can stochastic parrots help us code?

This book argues that the world needs to give less priority to AI, and more to programming languages. In previous chapters I have suggested that new kinds of code, including spreadsheets, visualisations and diagrams, will be an alternative to AI, enhancing human experience by allowing us to control and create with our computers, rather than subjecting us to meaningless repetition, pastiche and cliche.

Large language models (LLMs) do have a role to play in this world, but it’s important not to be distracted by one particularly deceptive fallacy - the idea that they will in future learn to code themselves. This would be especially damaging, the exact opposite of Moral Codes,  if the result was that users no longer had any opportunity to directly control computers or to see the code that did so.

The idea that AI might magically learn to code itself is the basic fallacy underlying many of the philosophical speculations that I dismissed in the first chapter of the book, including the idea that self-coding AIs could define themselves, taking control of their own evolution to become super-intelligent, raising the speculative challenge of “value alignment” - how we could know whether the hidden goals of the self-coded AI were compatible with our own. As I explained, if an argument relies on a central concept that changes its own definition, it’s more likely to be found in fiction than in engineering. A device that changes its own definition can’t be criticised for mathematical consistency, engineering feasibility or business logic, because the definition could always be changed to avoid the critique. The magical results are entertaining in speculative fiction, but not a useful basis for practical plans or forecasts.

LLMs may not be able to code themselves, but they certainly will become valuable tools for human programmers, complementary to the new coding approaches I have already described. This is a focus of continuing research, central to much of the work in my own group, and likely to result in significant advances over the next 5 to 10 years. Before going on to explain what is practical today, and what we can expect in the immediate future, it’s useful to consider some of the fundamental ways that LLMs can make useful contributions to Moral Codes.

Making life easier for programmers

I’ve explained what I don’t expect to happen - but what will? My first prediction is a safe one: applying LLMs to programming will make life easier for professional programmers. In the same way that changes to bank regulations usually make bankers richer, history shows that new software inventions benefit programmers first. This isn’t (only) indefensible self-interest, just that problems close to home are easier for programmers to see and understand, meaning those problems get fixed first.

Programmers don’t like typing more than they have to, which is why many of the programming languages and operating systems that were designed by programmers for programmers, including C, UNIX, ML, APL and Perl, use inscrutably short command names rather than longer words that would be easier to understand and remember. Because programmers don’t like to do a lot of typing, programming languages and editors also have excellent predictive text built in. This is particularly helpful when your boss, your colleagues, or the suppliers of the APIs your code is built on insist on making the labels longer than seems to you necessary (usually so that other people can understand what you are doing). For many years, programming editors have auto-completed all of these things, predicting what characters need to be typed next, and automatically fixing any accidental typos.

Some of these facilities seem intelligent, but only because another programmer anticipated what you might want to do, and intelligently added a heuristic rule to save you some effort. My favourite example was a smart editor that would recognise a line like “screen_reference.x = mouse_event.get_coordinates().x”, and automatically suggest the next line might be “screen_reference.y = mouse_event.get_coordinates().y”. I used that editor (IntelliJ IDEA from JetBrains) for the last big programming project I did. Compared to my early career, I estimated that the amount of time I spent typing was about 80% less than it would have been on comparable projects 30 years earlier.

LLMs are already trained with a large amount of program source code, obtained from public repositories like GitHub. As a result, new programming editors such as CoPilot from Github Labs are becoming really good at predicting “boilerplate” code like the example I have just given (and many others far longer than this) - common ways of doing things that always look pretty much the same, and require little thought but a lot of typing. I’m looking forward to the next big project I do myself, when predictive text in my code editor ought to be even better than the predictive text editor for English prose that I am using to type this chapter right now.

It’s important to note that predictive text for programming raises exactly the same problems as any other kind of predictive text, but with more worrying consequences. When a programmer is planning to type one line of code, but the system suggests another, they need to spend time thinking about whether the run-time execution of that alternative will still be what they wanted. Sometimes very small changes in code can have very large effects, to a far greater extent than with natural language, where common sense generally overrides the silliest interpretations, and a human reader would consider the intended meaning rather than the actual words.

Silly interpretation of small details is a major problem in technical work. Experiments with LLMs for formal academic writing show that they do an acceptable job at producing text with the general gist of accepted knowledge as reported on the Internet, but fail at the level of detail by “hallucinating” formal academic citations[1]. In conventional programming, the general gist is often expressed in natural language “comments” that are completely ignored by the machine and have no effect on program behaviour. The precise interpretation relies on mathematical or algebraic details that are hard for humans to read. If an LLM assists in such a way that the gist is right but the detail is wrong, this is the very opposite of the behaviour that would be helpful.

The comparative literalism of computers means the programmer will have to check automatically generated code very carefully, especially if the program is one where there would be any significant consequences from faulty operation. The attention investment judgement may not be worth it, if the alternative was to simply continue typing code where you are already clear about what you want to do. Of course, there is also a possibility that you might have made a mistake in your own code, and substitution of a more standardised cliché will correct it, just as predictive text keyboards have improved the general standard of spelling in the population by encouraging clichéd (but correct) spelling of most words rather than the eccentric spelling variations common in previous centuries.

Code reuse for the postmodern programmer

The only thing better, in my experience, than writing code more quickly, is not having to write it at all. I spent some years of my career creating and marketing new programming tools to realise the benefits of “software reuse”, where software developers solve a problem by simply using the same code someone else already wrote. If a logical operation or algorithm is rather conventional, it is quite likely that somebody, somewhere in the world, has coded it before. Software engineering researchers James Noble and Robert Biddle, in their Notes on Postmodern Programming[2], drew attention to how much contemporary programming work involves mashing up pieces of code found online, rather than typing long passages of original code into a text editor.

The most challenging problem, in this way of working, is not to write new code, but to find the existing piece of code that you need. The ObjectReuser system that I developed nearly 30 years ago at the Hitachi Europe Advanced Software centre paid as much attention to the problems of searching and managing documentation as to the code itself. In the days before the World Wide Web, we built a complete hypertext architecture to support these social and managerial implications of the software reuse philosophy. Now that the search and content management facilities of the Web have become universal tools, postmodern programmers find code to reuse on online advice forums like Stack Overflow, and in code repositories like GitHub, among many other resources.

Human-centred AI advocate Ben Shneiderman observes that any predictive text entry interface can also be regarded as a recommender system[3], where the recommendation being offered relates to opportunities for time saving, improved textual “productivity” (for those paid by the word), or in the analytic terms of my attention investment framework, attention savings. Rather than presenting tools like GitHub’s CoPilot as if they were Turing test-like collaborations between a human and AI programmer, it is more productive to think of any natural language input to a programming editor as a search query that could be used by the system to recommend existing code for reuse. The main decision for the programmer is whether they trust code found on the internet more than what they could write themselves. That’s an attention investment decision, and depends on many factors.

The practices of postmodern programming, whether or not supported by LLMs, introduce challenges for the managerial processes of the modern software industry. Program source code is copyrighted text, and the whole industry relies on a kind of negotiated truce, where it is well-known that every company relies on pieces of code that were probably written by programmers who work for their competitors. Open source free software created under the GNU “copyleft” licence obliges anyone using any part of that code to publish their own contributions on the same terms, with the result that many companies ban their programmers from even looking at code from these idealistic initiatives. But if future predictive code editors are trained on very large databases of source code that have been found online, as already done with natural language generators, who is to know whether the original code was created under one of those licences? Current LLM-based coding tools include guardrails to check their output against existing code, rejecting anything that is obviously exact plagiarism. Nevertheless, the original training data has all come from actual humans, so somebody’s work is clearly being used.

Automating student exercises

Perhaps the most frequently repeated pieces of code are the self-contained programming exercises assigned to learners, which help the student understand the basic mechanics of programming, but seldom achieve any practically useful function beyond that. There has been excellent progress in using machine learning models to replicate these trivial exercises, mainly because class exercises are simple and precisely specified, and perhaps also because students avoid real-world problems where the context would need to be explained. Introductory programming classes are much the same everywhere in the world, meaning that the same pieces of code can be found all over the Web (including places like Stack Overflow and GitHub, where students look for easy answers to their class exercises). It is unsurprising that these kinds of class exercise have already found their way into the training data for LLMs, and that the predictive models are easily able to output the correct answers[4].

Of course, automating student exercises is particularly pointless. Instead of learning multiplication tables at primary school, a child could easily type 1x5=, 2x5=, 3x5= and so on into a pocket calculator. They could even write a simple program to output 5, 10, 15, 20 (or ask an LLM to write that program for them). But none of these would have any value. The point of reciting a multiplication table is not to efficiently generate a sequence of numbers, but to internalise a mental skill. Similarly, nobody needs an automatic piano that plays up and down the notes of major and minor scales - piano students repeat those exercises in order to learn the skill, not because the scales are worth listening to. It is just as pointless for an LLM to “solve” a problem from an introductory programming textbook - the only purpose of those exercises was for the student to acquire craft skills, not because anyone thought the problem had practical significance.

Writing code in natural language

The postmodern programmer is interested in finding code that already works, not typing large quantities of new code themselves. But for those situations where a completely new piece of code does need to be written, it is interesting to consider the potential for LLMs to “translate” natural language descriptions of the required functionality directly into program source code. Current experiments with LLMs of the GPT-n series, including CoPilot from GitHub Labs, demonstrate that functionally and syntactically correct source code can be created from a natural language description of the kind that a programmer might write in documentation comments or as a “pseudocode” specification[5].

However, to fully address complex practical problems, there needs to be a complete and precise specification of the whole problem. It often turns out that this specification is more complicated than the code to solve it. Years of research into “formal specification languages” demonstrated firstly that it is possible to automatically generate a working program, given a sufficiently detailed specification of what that program needs to do, and secondly, that it is usually harder to write such complete and formal specifications than it is to write programs themselves. Such approaches continue to be demonstrated using modern machine learning methods, for example in the Barliman editor[6] that will help a programmer writing new source code, but only after somebody has created detailed test algorithms to precisely define what the program is supposed to do.

More effective use of LLMs in professional programming is a tremendously active area of current research, at the time I’m writing this. I expect great progress, including from my own collaborators and graduate students, in the near future. Quite likely there will be exciting advances within the next few years, including some before this book is even printed (some of the things I described in the last few pages had already been deployed in Microsoft’s Visual Studio product between the time that I first drafted the chapter, and when I came back to proof-read it a couple of months later). The rapid pace of current progress places some constraints on what it is useful to say in a book like this, but there are still some useful observations to be made about the human factors in professional programming, which have retained some consistent characteristics even when the tools themselves are changing.

The first of these is to note that previous generations of programming technology have always involved translation between different ways of describing the same problem. If LLMs are used in future to translate from English prompts to (for example) Python source code, we can expect that some of those earlier dynamics will be repeated. In previous generations, the purpose of many translators, compilers and interpreters has been to automatically generate text in an older programming language, based on input provided in a new and different language. The new language has always been intended to be more accessible or convenient for a wider range of users than the old one was. The essential insights needed to design that new and improved language were often provided by those who had been responsible for training programmers to use the old method, and could see its failings more clearly. It is often teachers, rather than engineers, who see how problems could be described in a more human-centric way.

Code (s)witching in a different voice

In earlier generations of human-centric programming technologies, the innovators who could see how the languages might be improved have often been women who did the work of programming, or who taught others to do it. In the early decades of large-scale computing, the technical practice of programming took place in gendered settings where the operation of machines by women was not perceived as a skilled activity, and programming was not perceived as fundamentally different to operating a machine, or even to typing[7]. Although developments in programming language theory have often been attributed to men, insights into the practice of programming have come from women who were rendered invisible by a focus on the machine rather than its operators[8].

This dynamic might be traced back to Ada Lovelace, often described as the first programmer for her work investigating how to instruct Babbage’s machine. Feminist histories of computing record the achievement of Adele Goldstine who wrote the first Operators’ Manual for ENIAC, codifying the methods for programming by circuit configuration that had been developed by the six women “computers” led by Jean Bartik and Betty Holberton. During the same period that ENIAC was being developed for ballistics calculations, the British Colossus machine, with its emphasis on the symbolic and linguistic operations of code-breaking, employed even larger numbers of women[9]. Kathleen Booth of Birkbeck College in London, who taught programming in the 1950’s and published an early textbook in 1958[10], is credited with the creation of one of the earliest symbolic assembly languages.

Following this era of machine-level programming, the development of “high-level” languages to enable more widespread access to computing was again led by women innovators, including Jean Sammet, whose 1969 book Programming Languages: History and Fundamentals was the first authoritative comparative text. Sammet campaigned in her early career for the importance of programming languages to be recognised among computer scientists, essential leadership work that was recognised by her election as the first woman president of the Association for Computing Machinery.

Most famous of these women innovators was Grace Hopper, whose early FLOW-MATIC and MATH-MATIC languages allowed more natural mathematical descriptions to be translated into assembly language code. Hopper was recruited to a committee initially convened by Mary K. Hawes, before becoming chair (and working with Jean Sammet) on the ground-breaking COBOL language. The recognisably LLM-like goal of COBOL was for programs to be specified using more natural business vocabulary, rather than the mathematical and engineering vocabularies of many previous languages. COBOL has declined in popularity in recent decades, and business computing leader IBM came to advocate Smalltalk-derived models as more appropriate to business computing. However, it is worth noting that the Smalltalk project itself benefited greatly from the human-centric insights of Adele Goldberg, who jointly directed much of the work attributed to Alan Kay, paying particular attention to how Smalltalk would be taught and used in schools.

Every generation of programming languages has involved the introduction of a new notation intended to support a wider community of practice by being more naturally accessible. This new notational code is accompanied by a translator of some kind that often outputs the old notation. In this respect, use of LLMs to generate source code would be simply another generation of translator. When used by professional programmers, or those experienced in the old notation, such tools become a labour-saving device to be used judiciously. The greatest challenges for human-centric computing come when improved accessibility of the new notation extends to people who may never have used the old one. Rather than switching between alternative codes, these users have to formulate a complete specification in the new way. In use of LLMs, the challenge of formulating a prompt text to get the result you want has become known as “prompt programming” or “prompt engineering”. It will be interesting to see how this is taught, and potentially supported with specialist prompt editors and prompt debugging tools that are designed in a human-centric way for new audiences.

The boring parts: software engineering and maintenance

This chapter has paid more attention to the tools used by professional programmers, and the potential for LLMs to provide future productivity improvements, but has not focused on one rather surprising aspect of professional programming work, which is that professional programmers do not spend very much time writing new code. The majority of programmers in the world spend most of their time “maintaining” code written by themselves or others - removing bugs, adding features, or making adjustments to accommodate constant changes to hardware, operating systems, databases, web services and so on. If software systems are not maintained by dedicated professionals, they suffer from “software rot” - parts just stop working for one reason or another. This may seem surprising, since there is no obvious part of software that ought to wear out. However, the outside world doesn’t stop changing, so although this situation is disappointing, the same is true of a shiny new house or car that does not last for ever, especially if we don’t preserve interfaces to the outside world by painting, waterproofing, changing worn tyres and so on[11].

LLMs don’t yet offer much help with the maintenance work that professional programmers actually spend their time on. Rather than translating a natural language specification into a completely new piece of code, the more important challenge for AI is known as “refactoring” - reorganizing the code you already have, to accommodate cumulative changes as well as new ways of thinking about the problem. In a mature deployed software application, the types of detail that must be managed are so widely distributed, while individually trivial, that models of the English language are neither sufficiently large nor sufficiently precise to be helpful.

Even localised changes may be hard to maintain, if the code was originally created using a language model. A routine challenge in software engineering projects that involve translating between multiple levels of design notation is the need for “round trip” modifications and maintenance. If a high-level notation is used to generate a lower-level one, and a detailed change is then made at the lower-level, will it be possible to update the original high-level specification to reflect how the system actually works now? Without that capability, many promising design notations have become effectively obsolete. Programs specified in natural language, and “compiled” to executable source code using an LLM, are likely to suffer the same fate.

One promising avenue for enquiry might be the implementation of live programming environments in which a natural language prompt would be continually re-evaluated while the generated source code keeps running. This would allow the programmer to experimentally modify the way they have expressed their idea, in response to the effects they are observing. As explained in a historical survey by programming educator and innovator Steve Tanimoto, many previous advances in programming languages have benefited as much from live execution as they have from more accessible notations[12]. Steve himself anticipated how such liveness might be extended to code prediction and generation of source code using machine learning methods[13].

An especially unwelcome liability for software engineers is code that works, but that you do not understand. In the kinds of software project that involve hundreds or even thousands of programmers, this is a constant problem, which has led to a whole research field of “code comprehension”. One of the important craft disciplines to be learned by software engineers early in their career is to avoid solutions that may be clever, but hard to understand. Adding code like that to a large code base quickly becomes a liability, or technical debtthat will result in additional cost of effort and attention for future maintainers.

This is an obvious problem in situations where an LLM might be used to generate code that the original author does not fully understand. A recent study of user experience with LLM-based programming editors shows that professional programmers are already very alert to this problem[14]. To see why they might be concerned, consider a senior software engineer who is responsible for reviewing code produced by less skilled junior colleagues. Inexperienced staff often produce code that looks dodgy in some way, and that does not work as required. Such problems can be identified in routine code inspections, and their consequences mitigated through quality control and training. It is quite possible that LLMs will help quality teams to identify some novice faults automatically in future, perhaps even preventing them at edit time, while the incorrect code is being entered. An alternative, also not unusual in professional situations, is for code to be correct, but to be laid out in a way that is not consistent with company standards or project conventions. This problem of code that is correct but does not look correct, is relatively easy to fix with training and automated formatting tools, and these could include LLM-based predictive text.

In contrast to these common situations that are routinely handled in software quality management, the worst problem for a software manager is code that seems OK at first glance, looking as if it might be correct, but actually has subtle flaws or inconsistencies in the details. That kind of problem is the hardest to spot during code quality inspections, leading to significant technical debt. Unfortunately, this is precisely the kind of code that LLMs are best at generating. Since this kind of code incurs the worst software debt, this is likely to be a major obstacle to adoption of LLM-generated code in serious professional settings.

While code output by an LLM has already been fascinating in the way it emulates the coding exercises prescribed to students, and is likely to be a helpful starting point for composing new code outlines, saving typing by predicting cliched boilerplate, or retrieving existing code for reuse, software engineers are already recognising the dangers of reliance on code that is plausible but incorrect[15].

LLMs for end-user developers

I explained at the start of this chapter that professional programmers are easily able to look after themselves, and will quickly recognise any opportunity to save effort by using new software technologies. Through comparison to the history of programming languages and software engineering practice, I’ve identified a number of ways in which LLMs are likely to result in further improvements to programming tools, although not perhaps the ones that optimistic promoters of AI currently pay most attention to.

Making such advances relevant to new audiences is more of a challenge. Just as with earlier techniques of example-based programming to “do the rest”, it is a mixed blessing to have automatically-generated code that you can’t read, don’t understand, and wouldn’t be able to adjust or modify[16]. The introduction of LLMs that have been trained to output Python code means that anyone with access to a system like ChatGPT can produce plausible-looking code. But how much use is this, if you don’t read Python and not sure what the code might do? Should you run it on your own computer, just to see what happens? What if there is some subtle flaw that has been disguised by the use of otherwise plausible variable names? Is there any chance that mischievous individuals somewhere on the Internet may have created intentionally destructive code as a joke or an artwork[17], or that state-sponsored actors have written code intended to vandalise or cause harm to others, or that people with extremist views might have expressed those views in biased, illegal or dangerous software that has somehow been incorporated into the language model?

These considerations highlight the ways in which automatically generated code could be the very opposite of the Moral Codes I am advocating. If the prompt given to an LLM was a complete specification of the program behaviour, this might be of some value to an inexperienced programmer. However, a genuinely complete specification is likely to be far more verbose than the eventual program, even if it does successfully avoid the dangerous ambiguities that could be introduced into the software through careless use of English vocabulary, syntax, and even punctuation. Just as with the early success of the COBOL language, a notation that is easily readable but lacking in concision may eventually be abandoned in the interests of efficiency.

A more exciting research agenda is to create new kinds of notation that are easily readable, interpretable, and modifiable, where new users can start from working examples that are recommended with guidance from English-language prompts. LLMs are not currently trained to output useful notations like spreadsheets, visual programming languages or diagrams, but in principle they can be, and rapid progress in being made. Perhaps even more useful is the sharing of such visual formalisms among a community of users, as described by end-user programming advocate Bonnie Nardi[18]. Finding a role for LLMs within real social structures would be a more worthwhile opportunity for human-centred AI and programming research.

[1] In Chapter [XX 5 XX] I described some of my own recent experiences with the “formal” aspects of academic writing, including details like precise page numbers and exact quotes, which are cheerfully ignored by LLMs. In programming languages, precise numbers and exact words are absolutely critical, even more so than in the formal practices of scientific citation.

[2] Noble and Biddle “Notes on postmodern programming”

[3] Ben Shneiderman, Human-Centered AI. (Oxford University Press, 2022).

[4] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le and Charles Sutton. "Program synthesis with large language models." arXiv preprint (2021). arXiv:2108.07732

[5] Although it is worth noting that, in my own experiments, these models have also produced source code in other languages that completely misinterprets the syntax of that language, replacing it with alternative punctuation which the LLM seems to have acquired from Python or C++, but is invalid in the programming language I have asked for.

[6] William Byrd and Greg Rosenblatt. Barliman smart editor prototype (GitHub repository)

[7] Mar Hicks reproduces an advertisement for an office computer called Susie (Stock Updating and Sales Invoicing Electronically), with text claiming that it could be “programmed in plain language from tape or by the typist”. See Mar Hicks, Programmed inequality: How Britain discarded women technologists and lost its edge in computing. (Cambridge, MA: MIT Press, 2017), 124-125.

[8] Hicks, Programmed inequality, 233-238

[9] A fact obscured in the historical record by the greater secrecy of the British code-breaking projects, as noted by Hicks, Programmed inequality, 34

[10] A more detailed and authoritative account of this period can be found in Martin Campbell-Kelly, "The development of computer programming in Britain (1945 to 1955)," Annals of the History of Computing 4, no. 2 (1982): 121-139.

[11] Thanks to Anica Alvarez Nishio for suggesting this explanation.

[12] Steven L. Tanimoto, "VIVA: A visual language for image processing," Journal of Visual Languages and Computing 1, no. 2 (1990): 127-139. See also Blackwell, Cocker et al Live Coding: A user’s manual

[13] Steven L. Tanimoto, "A perspective on the evolution of live programming," in Proceedings of the First International Workshop on Live Programming (LIVE) (2013), 31-34.

[14] Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan and Ben Zorn, “What is it like to program with artificial intelligence?,” in Proceedings of the 33rd Annual Conference of the Psychology of Programming Interest Group (PPIG) (2022).

[15] Sarkar et al. “What is it like to program with artificial intelligence?”

[16] Alan F. Blackwell, "SWYN: A Visual Representation for Regular Expressions," in Your wish is my command: Giving users the power to instruct their software, ed. Henry Lieberman. (San Francisco, CA: Morgan Kauffman 2001) , 245-270.

[17] We could imagine that someone might accidentally create source code for a piece of auto-destructive software art, such as Alex McLean’s, winner of the Transmediale software art award in 2001. The original program, with notes by Alex McLean on its invention, can be found at The text of the book you are reading now also includes a dangerous and destructive piece of code, flagged with a footnote warning readers not to try it. I have just confirmed that the same piece of code can be generated by an LLM (Google Bard v1.0.0, last updated: 2023-06-18 01:27:26 PST), which does helpfully follow it with the advice “WARNING: This command is destructive and will delete all files below the root directory, so be sure to use it with caution.”

[18] Nardi, A small matter of programming.

Peter Judge:

Plenty of women in the Cambridg Edsac project also…

Peter Judge:

“But how much use is this, if you don’t read Python and ARE not sure what the code might do? “ (typo, I think)