In November 2022, a new chatbot named ChatGPT seemed to start speaking English, producing reasonably coherent and grammatically correct sentences that many human English speakers found at least superficially convincing. Since then, ChatGPT, Google’s Gemini and Meta’s Llama have given the impression of knowing about 100 major languages, to varying degrees. The makers of these bots, perhaps conscious that their supposedly all-purpose tools cover only a fraction of human communication systems, are now setting their sights on the world’s 7,000 other languages — primarily oral, minority and Indigenous.
Overheated headlines like “Harnessing AI to Preserve the World’s Endangered Languages,” themselves now often generated by artificial intelligence, promise that this push will help the cause of linguistic diversity. By learning to “speak” these languages, the bots will supposedly ensure their survival, in one form or another, at a time when more and more languages are endangered. Some people, including many language activists, hope that the bots will serve as conversation partners, teachers, translators and even creators in these languages. For others, what matters is the symbolism of digital inclusion: the consolation that a language at least “lives” on a server somewhere.
But are these bots really “speakers” in the first place? And what is lost when we grant them that status, for the sake of convenience or out of desperation?
The large language models, or LLMs, that power tools like ChatGPT are so called not because they have a knack for languages but because they train on text — particularly the enormous, flawed, cocreated text known as the internet. In other words, they first take in the digitally available things that humans have previously written. They then remix and regurgitate this material in seemingly novel and sometimes useful ways.
LLMs may get better and better at sourcing certain kinds of information or completing certain kinds of tasks, but they are finders, not creators; they are mimics, not conversation partners; they are machines, not people.
When you ask an LLM a question, it runs calculations to predict which characters, words or sentences should go where to produce the semblance of a response. Call it auto-complete on steroids. A language model, as the linguist Emily M. Bender and her co-authors write, “is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.”
To invoke language when talking about LLMs is to misunderstand the nature of language and miss its fundamentally lived and embodied character. LLMs may get better and better at sourcing certain kinds of information or completing certain kinds of tasks, but they are finders, not creators; they are mimics, not conversation partners; they are machines, not people. It is striking to see bots recognized as “speakers” when we still fail to recognize as legitimate so many living, sophisticated communication practices. Many Indigenous languages are still ignorantly dismissed as not being “proper” languages; most hearing people are unaware that the world’s approximately 200 sign languages are fully independent linguistic systems; most animal communication practices are overlooked and misunderstood.
Languages are not born overnight but for the most part grow seamlessly out of other languages over generations and centuries.
Combinations of characters on a screen mean nothing without agency and intention, which bots lack but are designed to give the illusion of having. This illusion is the problem because it hollows out meaning. (“Agentic” AI, currently in the works, seeks only to deepen the illusion.) Meaning is neither “in” things nor in the mind, as the linguist Alton L. Becker wrote, but emerges “in the interaction of a living thing with its context.” Bots break the very idea of meaning by giving us text without context.
✺
All too slowly and haphazardly, I write these words (and you read them) in a contemporary, standardized form of American English, a language from the Germanic branch of the Indo-European family. From English to Russian to Hindi, hundreds of members of this far-flung family, for all their differences, share historical connections and similarities that demonstrably reach back several thousand years. This in turn is just what is recoverable — given the lack of written records and the current state of historical linguistics — of a continuous history of human language that stretches back at least 100,000 and possibly up to 1 million years. The traces of this unbroken linguistic inheritance, if you know where to look, are everywhere.
Languages are not born overnight but for the most part grow seamlessly out of other languages over generations and centuries. In fact, many linguists now doubt whether countable, neatly defined “languages” exist at all, at least without our forcing them into existence. There is no universal definition of what constitutes a language (versus a dialect, most famously). Nor can the fluid and variable linguistic practices of any community, or even the complex repertoire of any individual, be boiled down to a single fixed code of words and grammar.
Instead we all use multiple interacting codes paired with social meaning, as the linguist Jeff Good puts it, and it would be better to focus on documenting “the linguistic behavior and knowledge of individuals.” Writing with Michael Cysouw, he proposes that linguists abandon “language” for terms like “languoid” and “doculect” that can be more narrowly defined. At best a name like “English” (or “Estonian” or “Ewe”) “suffices as an informal communicative designation,” in the words of Cysouw and Good. We can use these designations, but should remember how much they oversimplify the fluid, layered, and multimodal nature of human communication.
For centuries, the practice of “speaking English” existed among a limited group of people bound by ties of place and kinship. Apart from neighboring peoples, few others had ever heard it, let alone learned it as a second language. Old English (as later scholars periodized it) evolved to Middle English and eventually Modern English, changing gradually across time and space beyond what any early speaker would have recognized. Forms of Norse, French, Latin and Greek used by more powerful peoples exerted a massive influence. Then English speakers began conquering Celtic, Native American, African, Australian and other peoples, pressuring or forcing them to give up their languages.
For the most part, I learned English by interacting with the small set of people around me during my first three or four years alive. My family had entered the language about a century earlier, after emigrating from Eastern Europe to New York City. Likewise, most of the world’s approximately 400 million native English speakers had other mother tongues in their families until relatively recently. Of course there is a hierarchy of dialects, with the U.K., the U.S., Canada, Australia and New Zealand often seen as an “inner circle,” crosscut by privilege and prejudice along racial, ethnic, class, regional and social lines. Then there are all the gradations and variations among the more than 1 billion people who have learned English as a second (or third or fourth) language very recently indeed.
Where AI promises magic, the most pressing need is for basic research, driven by communities.
The more it becomes a global lingua franca, the less English “belongs” to anyone in particular. Though still present, its historic layers and contemporary variants have been and are still being ruthlessly flattened, especially in writing. No single entity is in charge, but governments, companies, schools, publishers, media outlets and others have all been generating, spreading and standardizing English. The result is now a massive, interconnected and ever more predictable blob — the perfect training set for an LLM.