LLMs are Glorified Translators

I’ve been working with large language models (LLMs) for a few years now, and like many others, I was initially amazed by how good the responses were. At times, it genuinely felt as if these models were intelligent, just one step away from artificial general intelligence.

But here’s the uncomfortable part: despite rapid progress in results and usability, the core idea behind these models hasn’t changed as much as many people assume.

Model scale, training methods (such as reinforcement learning from human feedback), multimodal capabilities, and fine-tuning strategies have all improved drastically. However, the underlying modeling approach itself hasn’t evolved at the same pace.

A common explanation for why LLMs behave the way they do is emergent behavior, and that explanation is unsatisfying. If key capabilities simply emerge with scale, it becomes difficult to reason about how to improve them in a controlled way. Will adding more data increase accuracy, or could it degrade it? And where, exactly, does this so-called emergent behavior come from?

I have a theory.

Why Was the Transformer Architecture Invented?

The Transformer architecture, which underpins modern large language models (LLMs), was first introduced in the context of machine translation. Its key contribution was a new way to model relationships between tokens in a sequence, regardless of their distance from one another. This capability turned out to be especially effective for translation, where meaning often depends on long-range dependencies within a sentence.

Consider BERT as an early example. It was designed to process language bidirectionally, meaning it could condition on both the left and right context of a token at the same time. This made it particularly effective for tasks such as filling in missing words or modeling relationships between sentences, and it highlighted how powerful attention-based sequence modeling could be beyond simple left-to-right generation.

What’s important here isn’t machine translation itself, but the underlying pattern. At a fundamental level, translation is a sequence-to-sequence problem: one stream of tokens is transformed into another. Once that mechanism exists, there’s nothing particularly special about translating between languages versus transforming text in other ways.

What’s Really Improving with Scale?

I think we’re often misunderstanding what actually improves as these models scale. It’s commonly assumed that adding more data fundamentally changes a model’s behavior. I’m not convinced that’s true. The underlying mechanism remains the same; what changes is the breadth and density of information the model can draw from.

At a fundamental level, machine translation takes an input sequence (text in one language) and transforms it into an output sequence (text in another). Conceptually, this isn’t very different from other sequence-to-sequence problems. Deriving a question from an answer, generating a response from a prompt, or rephrasing text all follow the same basic pattern: mapping one sequence of tokens to another.

In essence, the original Transformer performed a sophisticated form of token transformation: it mapped an input stream of tokens (words or sub-word units) to an output stream of tokens. Modern LLMs, despite their incredible versatility, operate on this same fundamental principle. When an LLM “predicts the next token”, it’s effectively performing a continuous sequence of transformations, incrementally constructing an output based on patterns learned from large amounts of text.

So while modern LLMs aren’t literal translation systems in the narrow sense of converting between languages, their conceptual lineage traces back to the same idea. They still take an input sequence and “translate” it into a desired output sequence, just on a far more abstract and expansive scale than simple language pairs.

Adding more high-quality, diverse data does improve LLM performance, though not in a predictable or linear way. Behavior doesn’t emerge from architecture alone; it’s shaped by data, scale, and training dynamics. More data can improve generalization, but not necessarily in the ways we expect or desire.

People often draw parallels between LLMs and the human brain, but the similarities are mostly superficial. The brain’s structure, learning signals, and constraints are fundamentally different, and those differences matter far more than the sheer volume of data.

LLMs are trained primarily on language, which is already a highly abstracted and compressed representation of human experience. Language captures conclusions, descriptions, and categories, but not the full process by which those concepts are formed. Expecting a model trained on this data alone to outperform the systems that generated it in the first place is, at best, questionable.

The same issue applies to audio and image data. Even in computer vision, models rely heavily on labeled examples. We tell them “this is a bird” or “this is a cat”, feed them pixels or audio samples, and expect understanding to emerge. But whatever the model learns is still anchored in human-defined categories and annotations.

The human brain operates under very different constraints and learning signals.

The Real Value of Today’s LLMs

Despite these limitations, LLMs are incredibly valuable. Through regular use, it becomes clear that we’re primarily using them as translators, systems that map inputs to outputs based on learned patterns.

And when I say “translator”, I don’t just mean between languages such as English and Japanese. I mean:

A natural language request such as “Give me a list of healthy foods” translated into an actual list of foods.
A command like “create a calendar event for tomorrow at 3 PM” translated into a structured format, such as JSON or an API call, that a system can execute.
Translating one text style into another, for example simplifying a scientific paper so that a child can understand it.
Translating a descriptive prompt into token sequences that a diffusion model can use to generate an image.

In every case, the LLM performs a high-dimensional mapping from one token sequence to another, based on statistical patterns learned during training on massive datasets.

The “Stochastic Parrot”

This translator role aligns well with the idea of the “stochastic parrot”. The term suggests that LLMs excel at reproducing and recombining language patterns to generate statistically likely sequences of text, without possessing understanding in any meaningful sense. They produce outputs that are often novel and useful, but still grounded in learned correlations rather than intent or comprehension.

Why Do They Look Smart?

The output looks smart because it is relevant, coherent, and follows instructions. But if “smartness” implies understanding meaning, holding beliefs, forming intentions, or possessing consciousness, then these models fall short. The model doesn’t know what a “healthy food” is in a biological or nutritional sense. It knows that the phrase “healthy foods” frequently appears near lists containing items like broccoli, chicken breast, or quinoa.

What About Agentic Systems?

Agentic systems, where LLMs serve as central components, further reinforce this translator role. In these setups, the model often takes a human goal expressed in natural language and translates it into structured actions or commands that external tools can execute.

Any intelligence these systems exhibit comes less from the model itself and more from the surrounding system design: the tools available, the constraints imposed, and how everything is orchestrated. The LLM primarily functions as a natural language interface, converting human intent into machine-readable operations.

So, Are LLMs Intelligent?

A useful way to evaluate claims about LLM intelligence is to ask what would actually falsify them. For me, the clearest test would be the ability to reliably control hallucinations.

Hallucination is often framed as a defect that can be fixed with better alignment, more data, or improved prompting. I’m skeptical of that framing. As long as a model’s core objective is to predict the next token in a sequence, it must always produce something. Even “I don’t know” is just another learned pattern, not an expression of uncertainty grounded in knowledge or belief.

A system that could truly avoid hallucinating would need the ability to distinguish between what it knows, what it doesn’t know, and what cannot be inferred from the given context. That would require internal mechanisms for representing uncertainty and ignorance, mechanisms that go beyond probabilistic sequence continuation. I suspect this would demand architectural changes fundamentally incompatible with the current Transformer-based approach, rather than incremental improvements layered on top of it.

This doesn’t make LLMs less useful. On the contrary, it clarifies where their strengths lie. They excel at transforming information, intent, and format. They are exceptionally good at mapping one representation to another. But that same strength comes with inherent limits.

Understanding those limits matters. Not because LLMs are disappointing, but because mischaracterizing what they are leads to unrealistic expectations about what they can become.

Follow