How Large Language Models Actually Work: The Science Behind ChatGPT, Claude And Gemini

Posted by Baryon — June 17, 2026

⏱ 24 minutes

There is an explanation most people have been given for how ChatGPT works. It goes something like this: it has read a lot of text on the internet, and it predicts the next word. Technically, this is not wrong. But it is about as complete as saying the human brain is a collection of cells that send electrical signals. Accurate at the most literal level. Entirely useless for understanding what is actually happening.

In June 2017, eight researchers at Google Brain — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin — published a paper called Attention Is All You Need. It was presented at NeurIPS, the world’s most prestigious machine learning conference. It was 15 pages long. The abstract was modest. The title was almost playful.

It changed everything.

The architecture they described — the transformer — is the engine inside every large language model alive today. ChatGPT, Claude, Gemini, Llama, Mistral, Grok. Every model that has reshaped how humans interact with machines in the past three years runs on a direct descendant of what those eight researchers built. And in April 2026, Anthropic revealed just how far that architecture had come: Claude Mythos Preview, a model so capable at finding software vulnerabilities that they refused to release it publicly. Eight weeks later, on June 9, 2026, its guardrailed sibling Claude Fable 5 became the most powerful AI model ever made widely available to the public.

Understanding what a transformer actually does — at the level of the science — is understanding what this moment in history actually is.

This article covers all of it: the science, the key researchers, the landmark papers, the biology connections, the 2026 frontier, and what it means for the world being built around these systems.

Table of Contents

Before the Transformer: The Long Road from Neurons to Neural Networks

The story of large language models does not begin in 2017. It begins in 1943, in a collaboration between a neuroscientist and a logician that most people have never heard of.

Warren McCulloch, a neurophysiologist at the University of Illinois, and Walter Pitts, a mathematical prodigy who had taught himself formal logic as a teenager, published a paper in the Bulletin of Mathematical Biophysics titled A Logical Calculus of the Ideas Immanent in Nervous Activity. In it, they proposed the first mathematical model of a neuron: a simple threshold device that either fires or does not fire, depending on whether the sum of its inputs exceeds a threshold value. Binary. All or nothing. Just like the biological neuron they were modelling.

This was the founding act of artificial neural networks. A biological mechanism, abstracted into mathematics. To understand how DNA encodes the genetic instructions that build and wire those biological neurons is to understand the real starting point of every AI system running today.

In 1958, Frank Rosenblatt at Cornell Aeronautical Laboratory built the Perceptron — the first trainable neural network, described in Psychological Review. It could learn to classify simple patterns by adjusting connection weights based on errors. It caused enormous excitement. It also hit a wall: the Perceptron could only solve linearly separable problems. XOR — a basic logical operation — defeated it entirely. By the early 1970s, AI funding had collapsed into what became known as the first “AI winter.”

The thaw came in 1986. Geoffrey Hinton at the University of Toronto (later Google), David Rumelhart at UC San Diego, and Ronald Williams published a paper in Nature — volume 323, pages 533–536 — demonstrating that backpropagation could be used to train multi-layer networks. The algorithm computed how much each weight in the network contributed to the final error, then adjusted every weight accordingly, propagating corrections backwards through the layers. Multi-layer networks could now learn complex, non-linear patterns.

Hinton shared the 2018 Turing Award — computing’s Nobel Prize — with Yann LeCun (Bell Labs, now Meta AI) and Yoshua Bengio (Université de Montréal). LeCun developed convolutional neural networks (CNNs) in 1989, which proved extraordinarily powerful for image recognition. Bengio’s group pioneered language modelling with neural networks and developed key ideas about word representations that would lead directly to modern language AI.

In 2024, Hinton was awarded the Nobel Prize in Physics for his foundational contributions to the field he helped create — and then spent part of his speech warning the world about the dangers it presents.

By the 2010s, the field was dominated by recurrent neural networks (RNNs) and specifically Long Short-Term Memory (LSTM) networks, developed by Sepp Hochreiter (Technical University of Munich) and Jürgen Schmidhuber (IDSIA) in 1997 and published in Neural Computation. LSTMs introduced memory cells that could preserve information across long sequences — critical for language, where the meaning of a word can depend on context from dozens of words earlier.

But LSTMs had a fundamental constraint: they processed sequences word by word, one token at a time, carrying a hidden state forward. This sequential processing could not be parallelised efficiently on modern hardware. Long sequences caused the hidden state to become “saturated” — earlier context was progressively lost. And training was slow. The field needed something fundamentally different. The solution arrived in June 2017.

The 2017 Paper That Changed Everything: Attention Is All You Need

The full author list of the paper that launched the modern AI era is worth recording: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. All were at Google Brain or Google Research at the time of publication. The paper was submitted on June 12, 2017, accepted at NeurIPS that same year, and has since been cited over 130,000 times — making it one of the most influential scientific papers in the history of computer science.

The core proposal was radical in its simplicity: abandon recurrence entirely. Rather than processing language sequentially, process the entire sequence simultaneously. Rather than carrying a hidden state forward through time, use a mechanism called self-attention that lets every word in a sequence directly attend to every other word at once, computing how relevant each is to understanding the current word’s meaning.

The immediate results were stunning. On the WMT 2014 English-to-German translation benchmark, the transformer achieved a BLEU score of 28.4 — more than 2 points above the previous state of the art, using a fraction of the training time. The architecture was also, crucially, massively parallelisable: because all positions in a sequence were processed simultaneously, training could be distributed across thousands of GPU cores in ways that sequential RNN processing could not.

What followed was a cascade of breakthroughs, each building on the transformer architecture:

BERT (2018) — Bidirectional Encoder Representations from Transformers, Google. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Published in the proceedings of NAACL-HLT 2019. BERT introduced the idea of pre-training a transformer on a massive text corpus using masked language modelling — randomly hiding words and training the model to predict them from context. It then fine-tuned this pre-trained model on specific downstream tasks. BERT demolished eleven NLP benchmarks simultaneously when it was released. The pre-train-then-fine-tune paradigm it established became the template for everything that followed.

GPT (2018) and GPT-2 (2019) — OpenAI. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Where BERT used the transformer encoder (bidirectional context), GPT used the decoder — processing text left to right and training to predict the next token. This “autoregressive” approach turned out to be the key to generation. GPT-2 generated such fluent prose that OpenAI initially refused to release it, fearing misuse. In retrospect, it was a modest system. But it demonstrated the generative power latent in the architecture.

GPT-3 (2020) — 175 billion parameters. Trained on approximately 300 billion tokens. Published in Advances in Neural Information Processing Systems by Tom Brown and 29 co-authors. GPT-3 produced human-quality writing, answered factual questions, wrote code, and performed few-shot learning — tasks it had never been explicitly trained on — from just a handful of examples in the prompt. The age of the large language model had arrived.

What “Attention” Actually Means — and Why It Is the Key to Everything

The word “attention” in machine learning was not invented by the transformer paper. It was pioneered by Dzmitry Bahdanau at Université de Montréal (now at Google Brain), working with Kyunghyun Cho and Yoshua Bengio, in a 2015 paper submitted to ICLR titled Neural Machine Translation by Jointly Learning to Align and Translate. Their attention mechanism allowed a neural machine translation system to dynamically focus on different parts of the input sentence when generating each word of the output. It was a breakthrough for translation. The transformer took this idea and made it the entire architecture, discarding everything else.

Here is what attention actually computes, in plain language.

Consider the sentence: “The animal didn’t cross the street because it was too tired.” What does “it” refer to? The animal or the street? A human reader resolves this instantly by connecting “it” to “animal” based on the semantic relationship between “tired” and a living creature. A sequential model has to carry this information forward through hidden states across multiple timesteps, and over long distances that information degrades.

Attention solves this by allowing “it” to directly look at “animal” and “street” simultaneously and compute which is more relevant, in a single operation.

Technically, for each token (word or word fragment) in the sequence, the transformer computes three vectors:

Query (Q): What this token is looking for in others
Key (K): What information this token offers to others
Value (V): The actual content this token contributes

The attention score between two tokens is computed by taking the dot product of the first token’s Query vector with the second token’s Key vector, scaling it, and passing it through a softmax function to produce a probability distribution over all other tokens. Each token then receives a weighted sum of all other tokens’ Value vectors, where the weights are determined by those attention scores.

This happens not once but in parallel across multiple attention heads — each head learning to attend to different types of relationships simultaneously. One head might learn syntactic dependencies (subject-verb agreement). Another might track coreference (what “it” refers to). Another might capture semantic similarity. The outputs of all heads are concatenated and projected into a final representation. This is multi-head attention.

Think of it the way a researcher reads a scientific paper. When reading the conclusion, they are simultaneously holding in mind the methodology, the results tables, the original hypothesis stated in the abstract, and the caveats in the discussion section — not reading the conclusion in isolation from everything that came before. Attention gives the model the same ability: to understand any part of a text in the context of every other part, simultaneously, in a single computational step.

This is why, when you ask ChatGPT a question that depends on something you said twenty messages ago, it can still answer accurately. The attention mechanism ranges across the entire context window — everything in the conversation — at once. No sequential state degradation. No forgetting.

Tokens, Embeddings and the Strange Way Language Models Read

Large language models do not read words. They read tokens — chunks of text that are often, but not always, complete words or parts of words. The tokenisation scheme is determined at training time and fixed thereafter. GPT-4 uses a tokeniser called Byte Pair Encoding (BPE) with a vocabulary of approximately 100,000 tokens. The word “genetics” might be one token. “Epigenetics” might be two: “epi” and “genetics.” “CRISPR” might tokenise differently again.

The concept of representing words as dense numerical vectors was formalised by Tomas Mikolov and colleagues at Google in 2013, in a paper titled Efficient Estimation of Word Representations in Vector Space, presented at ICLR. Their system, Word2Vec, trained on a large text corpus and produced vectors of typically 100–300 numbers for each word. Words with related meanings ended up geometrically close to each other in this vector space.

The now-famous example: King − Man + Woman ≈ Queen. The model had learned semantic relationships as geometry, without any human labelling.

In a modern transformer, each token is represented as an embedding vector of typically 768 to 12,288 numbers, depending on model size. These embeddings are learned during training. But unlike Word2Vec, which gives each word a fixed representation, transformer embeddings are contextual: the embedding for “bank” in “river bank” is a different vector from “bank” in “central bank,” because the attention mechanism has incorporated context from surrounding tokens into each representation.

Because transformers process all tokens simultaneously rather than sequentially, they need a way to encode the order of tokens. Without this, “dog bites man” and “man bites dog” would produce the same representations — the same tokens in different order.

The original transformer solved this with positional encodings: sine and cosine functions of different frequencies added to each token’s embedding, encoding its position in the sequence. Modern models use learned positional encodings or more sophisticated schemes like RoPE (Rotary Position Embedding) and ALiBi, which allow models to generalise better to sequences longer than those seen during training.

The strangeness worth dwelling on is this: the embedding space is not designed by any human. Nobody sat down and decided that “Paris” and “France” should be related, or that “mitosis” and “meiosis” should be nearby but not identical. The geometry of meaning builds itself, entirely from training. The model learns what things mean because it learns to predict what comes next — and meaning, in the statistical structure of human language, leaves a recoverable fingerprint in patterns of co-occurrence.

Training: How an LLM Learns from 10 Trillion Words

The training of a large language model happens in two phases. The first is pretraining. The second is alignment. Both matter enormously. And they operate on radically different scales.

Pretraining: The Scale of It

During pretraining, the model is given a massive corpus of text and trained on a single objective: predict the next token. The text corpus for GPT-3 included Common Crawl (filtered web text), WebText2 (Reddit-linked pages), Books1, Books2, and English Wikipedia.

Total size: approximately 300 billion tokens. For GPT-4 and subsequent models, the training data is estimated in the range of several trillion tokens. For Gemini Ultra, Google describes training on a dataset that includes web documents, books, code, and multimodal data.

For each token in the training data, the model sees all preceding tokens and must predict the next one. The prediction is compared to the true next token using cross-entropy loss. The error is then propagated backwards through the network using the backpropagation algorithm that Hinton and Rumelhart demonstrated in 1986, and all the model’s weights are nudged in directions that reduce the error. This is done using a variant of stochastic gradient descent, most commonly the Adam optimiser (Kingma and Ba, 2015).

The numbers are difficult to hold in mind. GPT-3 has 175 billion parameters. A parameter is a single floating-point number. 175 billion of them. Each forward pass through GPT-3 performs approximately 350 billion floating-point operations. Training GPT-3 required an estimated 3.14 × 10²³ floating-point operations in total — consuming roughly 190,000 GPU-hours on V100 hardware. The estimated compute cost of training GPT-4 exceeds $100 million.

Alignment: Making the Model Useful and Safe

A pretrained model is competent but uncontrolled. It generates plausible text continuations, which means it can write harmful content as readily as a recipe. It might answer the same question differently depending on how the question is phrased. It has no concept of what a helpful, honest, harmless response looks like — it has only learned statistical patterns.

RLHF (Reinforcement Learning from Human Feedback) was the technique that turned pretrained models into useful assistants. Developed at OpenAI and described in a landmark paper by Long Ouyang and colleagues, published at NeurIPS 2022 — Training language models to follow instructions with human feedback — RLHF works as follows:

Human raters compare pairs of model outputs and indicate which is better.
A separate reward model is trained to predict which outputs humans prefer.
The main language model is then fine-tuned using reinforcement learning to maximise the reward model’s score.

This is why ChatGPT feels collaborative and helpful rather than raw. RLHF is what makes the difference between a statistical next-token predictor and an assistant that maintains a coherent, useful persona across a conversation.

Anthropic, the company behind Claude, developed an alternative alignment approach: Constitutional AI (CAI), described by Yuntao Bai and colleagues in an arXiv paper published December 2022. Rather than relying entirely on human feedback, Constitutional AI gives the model a set of written principles — a “constitution” — and trains it to critique and revise its own outputs against those principles.

The model learns to evaluate itself. The result is an alignment approach that is more scalable and, Anthropic argues, more interpretable than pure RLHF.

Emergence: The Abilities Nobody Programmed In

In August 2022, a team of researchers at Google Brain and Stanford — Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus — published a paper in Transactions on Machine Learning Research titled Emergent Abilities of Large Language Models.

It catalogued something that had been noticed informally but never formally characterised: certain capabilities of language models are essentially absent in small models and present in large models, appearing apparently suddenly as model scale increases.

Examples of emergent abilities they catalogued include:

Chain-of-thought reasoning: Breaking a multi-step problem into intermediate steps, dramatically improving accuracy on maths and logic. Not present in models below a certain scale; appears reliably above it.
Multi-step arithmetic: Solving problems requiring several sequential calculations. Small models fail entirely; large models succeed.
Word unscrambling: Recognising and correcting scrambled words. Below approximately 100 billion parameters: near-random performance. Above: near-perfect performance.
BIG-bench tasks: A benchmark suite of 204 tasks specifically designed to be beyond current model capabilities — abilities on many of these tasks show sharp discontinuous improvements at particular model scales.

The concept echoes something biologists know well from studying complex systems: that emergent properties of a large system cannot be predicted by studying its components in isolation. The wetness of water cannot be derived from studying a single H₂O molecule.

The consciousness of a brain cannot be deduced from examining a single neuron. At what point do statistical patterns of language prediction give rise to something that looks, functionally, like reasoning? This is one of the most contested questions in contemporary AI science, and it connects directly to the questions raised in cancer genetics research about how complex systems — whether tumours or neural networks — develop new capabilities through the accumulation of changes at the component level.

The Scientific Counterargument

Not everyone accepts that emergence is real rather than an artefact of measurement. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo from Stanford published a paper at NeurIPS 2023 titled Are Emergent Abilities of Large Language Models a Mirage? They argued that the apparent sharpness of emergent transitions is an artefact of using discontinuous, threshold-based metrics. When you replace pass/fail benchmarks with continuous metrics, many “emergent” abilities look like smooth improvements that have been present all along, merely invisible to the binary measurement scheme.

This debate has not been resolved. It matters practically: if emergent abilities are smooth and predictable, then the behaviour of future models can in principle be extrapolated from current ones. If they are genuinely sudden and discontinuous — real phase transitions in capability — then larger models may surprise us in ways we cannot anticipate.

Given that Claude Mythos’s benchmark results represented discontinuous jumps of 10–55 percentage points over its predecessor, the debate over emergence is no longer merely academic.

How ChatGPT, Claude and Gemini Differ — and Why It Matters

All three major frontier models share the same foundational architecture — the transformer, as defined in the 2017 paper. But the choices made in training, alignment, and design philosophy produce models with genuinely different characteristics, strengths, and failure modes.

ChatGPT / GPT-4o (OpenAI)

OpenAI’s GPT-4 series uses a decoder-only transformer and was trained with RLHF alignment, building on the methodology described by Ouyang et al. (2022). GPT-4o, released in May 2024, introduced native multimodality — the ability to process images, audio, and text in a unified model rather than piping inputs through separate specialised systems.

Key figures: Sam Altman (CEO), Greg Brockman (co-founder), Ilya Sutskever (Chief Scientist until departure in May 2024). Scientific groundwork: John Schulman (co-developed PPO, the reinforcement learning algorithm underlying RLHF at OpenAI, left for Anthropic in 2024). Context window: up to 128,000 tokens in GPT-4 Turbo. Estimated parameters: over 1 trillion, though OpenAI has not officially confirmed this.

GPT-4 demonstrated remarkable performance on standardised tests — scoring in the 90th percentile on the bar exam, the SAT, and numerous medical licensing examinations — results reported in OpenAI’s technical report published March 2023.

Claude (Anthropic)

Anthropic was founded in 2021 by Dario Amodei and Daniela Amodei, along with several other former OpenAI researchers including Tom Brown (lead author of the GPT-3 paper) and Chris Olah (who leads Anthropic’s mechanistic interpretability research programme). The company’s stated mission is the responsible development of AI for the long-term benefit of humanity.

Claude models use Constitutional AI alignment rather than pure RLHF, giving the model explicit ethical principles to reason against. Anthropic has invested unusually heavily in mechanistic interpretability — the attempt to understand, at the level of individual circuits and components, what is actually happening inside the model. Chris Olah’s team published A Mathematical Framework for Transformer Circuits (2022) in the Transformer Circuits Thread, identifying the actual algorithms implemented by attention heads — such as induction heads, which enable in-context learning. Claude 3 Opus achieved a context window of 200,000 tokens, at the time one of the largest of any publicly available model.

Gemini (Google DeepMind)

Google DeepMind was formed in April 2023 by the merger of Google Brain (the team that wrote Attention Is All You Need) and DeepMind (the team behind AlphaGo, AlphaFold, and AlphaGenome). It is led by Demis Hassabis, co-founder and CEO. Notably, Noam Shazeer — second author of the original transformer paper — left Google to found Character.AI in 2021 and was effectively re-acquired by Google in 2024 for $2.7 billion, returning to the company whose foundational paper he co-wrote.

Gemini Ultra was designed with native multimodality from the ground up — trained simultaneously on text, images, audio, video, and code rather than adding visual capability as an afterthought.

The Gemini team published their technical report in December 2023 (Gemini: A Family of Highly Capable Multimodal Models, arXiv:2312.11805). Gemini Ultra outperformed GPT-4 on 30 of 32 academic benchmarks at launch, including the Massive Multitask Language Understanding (MMLU) benchmark at which it became the first model to exceed human expert performance.

Breaking 2026: Claude Mythos, Project Glasswing and Claude Fable 5

📡 Latest developments — June 2026

The events of April and June 2026 mark the most significant moment in AI deployment history since ChatGPT launched in November 2022. Here is what happened, what it means, and why it matters for anyone trying to understand where this technology is heading.

Claude Mythos Preview — April 7, 2026

On April 7, 2026, Anthropic announced Claude Mythos Preview — internally codenamed “Capybara” — the most advanced AI model it had ever built, positioned a full capability tier above its public Opus and Sonnet model lines. The announcement came with an unusual decision: Anthropic would not release Mythos to the public.

The reason was stark. During internal testing, Anthropic’s red team found that Mythos Preview could identify and exploit zero-day vulnerabilities — previously unknown software flaws — in every major operating system and web browser, at a level matching the best human security researchers.

The model autonomously wrote a remote code execution exploit against FreeBSD’s NFS server from a 17-year-old bug. It found a 27-year-old flaw in OpenBSD, an operating system known specifically for its security practices. Anthropic’s system card for Mythos Preview — a 244-page document — described these findings in detail and concluded that the model’s cybersecurity capabilities were “surprisingly advanced,” sufficient to represent a material risk if made widely available.

The benchmark results were not incremental improvements. They were a discontinuity:

97.6%

USAMO 2026 Mathematics Olympiad (vs 42.3% for Opus 4.6)

93.9%

SWE-bench Verified (real software engineering tasks)

77.8%

SWE-bench Pro (vs 57.7% for GPT-5.4)

64.7%

Humanity’s Last Exam with tools (vs 58.7% for GPT-5.4)

Instead of a public release, Anthropic launched Project Glasswing: an invitation-only consortium of approximately 50 organisations granted controlled access to Mythos Preview for the purpose of finding and fixing vulnerabilities in critical software infrastructure. Founding members included Apple, Amazon Web Services, Microsoft (which immediately began running Mythos against its own codebases through the Microsoft Security Response Center), Google, Cisco, CrowdStrike, Palo Alto Networks, and NVIDIA.

The results from the first month of Project Glasswing were significant: over 10,000 high- or critical-severity vulnerabilities discovered across partner software. Of those, 530 had been disclosed to maintainers and 75 patched with public advisories. The true-positive rate on assessed findings: 90.6%.

“We formed Project Glasswing because of capabilities we have observed in a new frontier model trained by Anthropic that we believe could reshape cybersecurity. AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.”
— Anthropic, Project Glasswing announcement, April 7, 2026

Claude Fable 5 and Mythos 5 — June 9, 2026

On June 9, 2026 — eight weeks after the Mythos Preview announcement — Anthropic made a second move: the simultaneous release of Claude Fable 5 and Claude Mythos 5.

Fable 5 and Mythos 5 share the same underlying model and architecture. They are, in Anthropic’s framing, the same system with a different layer of safety classifiers. Claude Fable 5 — the publicly available version — runs with always-on adaptive thinking, a 1-million-token context window, and 128,000 output tokens. In high-risk domains (cybersecurity, biology, chemistry, distillation), it automatically falls back to Claude Opus 4.8. Claude Mythos 5 — the restricted version — lifts those classifiers and remains available only to Project Glasswing partners.

The release came five days after Anthropic published a paper titled When AI builds itself, calling for a globally coordinated slowdown on frontier AI development due to safety concerns, specifically citing the risk of recursive self-improvement (RSI) — the possibility that AI systems could soon autonomously improve themselves without human oversight. Releasing the most capable publicly available AI model days after warning the world about AI safety reflects the tension every frontier lab faces: the safest option and the competitive reality do not always point in the same direction.

Fable 5 is available on the Claude API, AWS Bedrock, Amazon Bedrock, Vertex AI, Microsoft Foundry, and GitHub Copilot. Pricing: $10 per million input tokens and $50 per million output tokens. For Claude.ai subscribers (Pro, Max, Team, Enterprise plans), Fable 5 was included at no extra cost through June 22, 2026, after which usage credits are required.

Within three days of launch, a US government export-control directive temporarily forced both Fable 5 and Mythos 5 offline — an unprecedented regulatory intervention that underscores how seriously governments are now treating frontier AI capability. All other Claude models (Opus, Sonnet, Haiku) remained unaffected. Anthropic has stated it is working to restore access as quickly as the export directive permits.

The Fable 5 and Mythos 5 generation represents something genuinely new: the first time that Mythos-class AI capability — the tier that Anthropic itself said was too dangerous to release eight weeks earlier — crossed from controlled research preview into the hands of enterprise teams and individual developers. How that transition unfolds will be one of the defining stories of the coming years in AI.

The Biology Connection: How LLMs Are Already Reshaping Genetics Research

The architecture developed to understand human language turns out to be extraordinarily powerful for reading another kind of sequence: the sequences of nucleotides in DNA, the sequences of amino acids in proteins, the regulatory grammar of the genome. This is not a metaphor. The transformer architecture is being applied directly to biological sequences, with results that are transforming molecular biology.

AlphaFold and the Protein Language Model Revolution

In November 2020, DeepMind’s AlphaFold 2 — led by John Jumper (now at Google DeepMind) — solved the protein folding problem: predicting the three-dimensional structure of a protein from its amino acid sequence. The problem had been unsolved for 50 years. AlphaFold used a transformer-based architecture trained on known protein structures. The 2021 paper in Nature (Jumper et al., volume 596, pages 583–589) described a system that predicted protein structures with accuracy matching experimental methods. In 2024, Jumper and Hassabis shared the Nobel Prize in Chemistry for this work.

Meta AI’s ESMFold took this further: Lin et al. (2023, Science, vol. 379, 1123–1130) trained a pure language model on 250 million protein sequences, treating amino acid letters exactly as a language model treats text tokens. The model learned protein structure prediction as a downstream task from sequence alone — no evolutionary information required. ESMFold can predict protein structures at comparable accuracy to AlphaFold at a fraction of the computational cost. Protein sequences are a language. LLMs can read them.

AlphaGenome: The Transformer Reading Non-Coding DNA

In 2025, Google DeepMind released AlphaGenome — a transformer model trained to decode what happens in the 98% of the human genome that does not code for proteins: the regulatory regions, enhancers, silencers, and non-coding RNA sequences that determine when, where, and how much of each gene is expressed. The model uses the same attention mechanism described in Attention Is All You Need to understand the regulatory context of DNA sequences — how a regulatory element far upstream influences the expression of a gene thousands of base pairs away.

This is exactly the kind of long-range dependency that attention was designed to capture. The biology of the genome and the architecture of the transformer are, at a fundamental level, solving the same problem: understanding meaning at a distance.

This has direct implications for understanding how CRISPR gene editing tools can be designed to target specific genomic sequences with greater precision, since AlphaGenome can predict the downstream effects of edits in regulatory regions that were previously opaque to researchers.

Nucleotide Transformers and Genomic Foundation Models

In 2023, Hugo Dalla-Torre and colleagues at InstaDeep and Nvidia published The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics in Nature Methods. They trained large language models on 3,202 human genomes and 850 additional genomes from diverse species, treating DNA as a sequence of nucleotide tokens.

The resulting models could predict gene expression levels, chromatin accessibility, and the functional consequences of genetic variants — tasks that previously required separate, specialised models for each. A single foundation model, trained on the grammar of DNA, generalised across the biology of the genome.

The implications extend to cancer genetics: genomic foundation models can be fine-tuned to identify somatic mutations, predict which variants affect splicing, and model the effect of specific DNA changes on protein function — all from raw sequence, at a scale that manual analysis could never approach. What was once a months-long research programme can now be run in minutes.

AI Drug Discovery

Insilico Medicine used generative transformer models to design a novel drug candidate from scratch — identifying a target, generating a molecule to hit it, and moving it to Phase 2 clinical trials in under 30 months, a timeline that would typically take a decade and cost hundreds of millions of dollars. This connects directly to how RNA biology is being leveraged in medicine: the same AI systems that understand language can be directed to understand the molecular language of drug-target binding, opening therapeutic avenues that were previously too expensive to explore.

What Scientists and Researchers Say

“The surprising thing is not that attention works. The surprising thing is that attention is enough. We threw away everything else — all the recurrence, all the convolutions — and what remained was sufficient to surpass every previous approach.”
— Łukasz Kaiser, Google Brain, co-author, Attention Is All You Need (2017)

“We don’t know why large language models work as well as they do. That should be the central fact about the field right now. We are deploying systems we cannot fully explain.”
— Geoffrey Hinton, Nobel Laureate in Physics 2024, former VP and Engineering Fellow, Google

“The question is not whether these models understand language. The question is what ‘understanding’ means — and whether the answer we assumed was right.”
— Emily M. Bender, Professor of Linguistics, University of Washington; co-author, On the Dangers of Stochastic Parrots (FAccT 2021)

“We found that language models trained only to predict text develop internal representations of the world — representations of space, time, and entities — that nobody put there.”
— Jack Lindsey, Research Scientist, Anthropic; Biology of a Large Language Model (2025)

“Treating protein sequences as a language — and applying the same transformer architecture that reads text — turns out to be one of the most powerful ideas in computational biology of the past decade.”
— Alexander Rives, Meta AI Research; lead on the ESM protein language model series

“AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities. The implications for both offensive and defensive cybersecurity are profound.”
— Anthropic, Claude Mythos System Card, April 7, 2026

What LLMs Cannot Do — and the Honest Limits of the Science

The capabilities of large language models are real and rapidly expanding. But the honest scientific account requires a clear-eyed description of what these systems cannot do — because the gap between what they appear to do and what they are actually doing has generated more confusion than almost any other topic in contemporary science communication.

They Hallucinate

LLMs produce confident, fluent, factually incorrect statements. This is not a bug to be patched; it is a structural consequence of the architecture. The model generates plausible-sounding token sequences. It has no internal mechanism for verifying that a statement it generates is true. It does not consult an internal database of facts; it computes a probability distribution over possible next tokens based on patterns in training data. When those patterns mislead it, the output is wrong — and stated with the same fluency and confidence as a correct answer.

They Have No Persistent Memory

Without external tooling, each conversation starts from zero. A language model has no memory of previous conversations. It cannot learn from interactions after training has ended. Its knowledge is frozen at the training data cutoff. This is why Claude will tell you it has a knowledge cutoff, and why it may be unaware of events that occurred after that date.

They Cannot Do Reliable Arithmetic Natively

This surprises many users. LLMs are pattern matchers trained on text. They predict numerals that tend to follow the patterns of arithmetic — which means they often get simple sums right, because those patterns are strongly represented in training data. But they do not perform arithmetic in the way a calculator does: by applying precise algorithmic rules. Multi-step arithmetic without a code interpreter produces errors at a rate that makes native calculation unreliable. This is why every major LLM is given access to a code interpreter or calculator tool.

Nobody Fully Understands Why They Work

This is the most important limitation to acknowledge. The field of mechanistic interpretability — the attempt to understand, at the level of individual neurons, attention heads, and circuits, what is actually happening inside transformer models — is still in its early stages. Anthropic’s interpretability team has made significant progress: their 2022 paper A Mathematical Framework for Transformer Circuits identified specific, interpretable algorithms implemented by attention heads, such as “induction heads” that enable in-context learning. Their 2025 paper Biology of a Large Language Model found that models develop internal representations of space, time, emotion, and intention — concepts that emerge from training without being explicitly defined.

But the gap between understanding individual circuits and understanding the behaviour of a 1-trillion-parameter model is vast. The Stochastic Parrots paper (Bender et al., FAccT 2021) argued that LLMs produce the form of language without grounded meaning — sophisticated pattern matching that mimics understanding without possessing it.

Whether this critique is correct, partially correct, or substantially wrong remains one of the most actively contested questions in AI science. The honest answer, as of June 2026, is: we do not fully know. And this uncertainty, in a world where Mythos-class models are being deployed against critical infrastructure, matters enormously.

These questions — about consciousness, understanding, and what it means for a system to “know” something — connect directly to the deepest unsolved problems in neuroscience. The same questions that make it difficult to define consciousness in biological organisms make it equally difficult to determine whether artificial ones possess anything analogous. For a deeper look at that problem, see our article on the hard problem of consciousness and what modern neuroscience says about it.

Frequently Asked Questions

What is a large language model, in plain English?

A large language model is a neural network — a system of billions of mathematical parameters — trained on massive amounts of text to predict what word or token comes next in a sequence. Through this training process, it develops the ability to generate coherent, contextually appropriate text across a wide range of tasks: answering questions, writing code, translating languages, summarising documents, and much more.

The “large” in LLM refers to the number of parameters (typically hundreds of billions to over a trillion) and the scale of the training data (typically trillions of words).

What is a transformer and why does it matter?

The transformer is the neural network architecture described in the 2017 Google Brain paper Attention Is All You Need (Vaswani et al.). It replaced earlier recurrent neural networks by processing entire sequences simultaneously using a mechanism called self-attention, which allows every word in a sequence to directly attend to every other word at once.

The transformer is faster to train, handles long-range dependencies better, and scales more effectively than its predecessors. Every major AI language model in use today — ChatGPT, Claude, Gemini, Llama — is a transformer or a close descendant of one.

Is ChatGPT actually understanding language, or just predicting words?

This is one of the most contested questions in AI science. The honest answer is: we do not know, and the answer depends on how you define “understanding.” LLMs process language using internal representations that demonstrably capture semantic relationships — meaning, not just surface pattern.

Anthropic’s 2025 Biology of a Large Language Model paper found that Claude develops internal representations of space, time, emotion, and intent. But critics like Emily Bender (University of Washington) argue that this is sophisticated pattern matching that produces the form of understanding without the grounded meaning a speaker of a language possesses. Both positions have serious scientific support. The question remains open.

Why do language models make things up (hallucinate)?

Because they are trained to produce plausible next tokens, not to verify truth. The architecture generates what tends to follow from what came before, based on patterns in training data. It has no internal fact-checker, no database of verified claims, no mechanism for distinguishing high-confidence knowledge from low-confidence inference.

When the training data patterns mislead it — for example, because a plausible-sounding but incorrect claim appears frequently in training text — the model confidently produces the incorrect claim. This is a structural property of the architecture, not a solvable bug, though tool use (code interpreters, web search, retrieval-augmented generation) significantly reduces its impact in practice.

What is Claude Mythos and why was it not released to the public?

Claude Mythos Preview, announced April 7, 2026, is Anthropic’s most advanced AI model, positioned a tier above its public Opus models. It was withheld from public release because internal testing found it could identify and exploit zero-day software vulnerabilities at a level matching the best human security researchers.

Anthropic instead deployed it through Project Glasswing, a controlled consortium of approximately 50 cybersecurity and critical-infrastructure organisations. On June 9, 2026, Claude Fable 5 — a version of the same model with safety classifiers for high-risk domains — became publicly available, while Claude Mythos 5 (classifiers removed) remained restricted to Project Glasswing partners.

How does AI actually read DNA — what does that mean technically?

Genomic language models treat DNA sequences exactly as text language models treat sentences: each nucleotide or codon is a token, and the transformer architecture learns from billions of base pairs of genome data what sequence patterns predict about downstream biology — gene expression, protein structure, regulatory activity, and the functional consequences of mutations.

Meta AI’s ESMFold trained on 250 million protein sequences and predicted protein 3D structure with near-experimental accuracy. Google DeepMind’s AlphaGenome learns the regulatory grammar of non-coding DNA using the same attention mechanism that ChatGPT uses to understand context in a conversation. The grammar of the genome and the grammar of human language are both sequence-based, context-dependent systems — and the transformer handles both.

What is the difference between Claude Fable 5 and Claude Mythos 5?

They share the same underlying model and the same performance specifications. The difference is a layer of safety classifiers. Claude Fable 5 (publicly available) includes always-on safety classifiers that automatically fall back to Claude Opus 4.8 for requests in high-risk domains: cybersecurity, biology, chemistry, and distillation.

Claude Mythos 5 (restricted to Project Glasswing partners) removes those classifiers, making its full capabilities available to vetted cybersecurity organisations. Both have a 1-million-token context window and are priced at $10 per million input tokens and $50 per million output tokens.

What are model parameters and why does their number matter?

Parameters are the individual numerical weights in a neural network, adjusted during training to minimise prediction error. Every attention weight, every embedding value, every feed-forward network weight is a parameter. More parameters give the model more capacity to represent complex relationships in its training data. GPT-3 has 175 billion parameters. GPT-4 is estimated at over one trillion.

However, the relationship between parameter count and capability is not linear — training data quality, architecture choices, and alignment procedures matter as much or more. Recent work on “small language models” has shown that models of 7–70 billion parameters, trained with higher-quality data and better techniques, can match the performance of much larger models on specific tasks.

Sources

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. arxiv.org/abs/1810.04805
Brown, T. et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33. arxiv.org/abs/2005.14165
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. arxiv.org/abs/2203.02155
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arxiv.org/abs/2212.08073
Wei, J. et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. arxiv.org/abs/2206.07682
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? Advances in Neural Information Processing Systems, 36. arxiv.org/abs/2304.15004
Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. doi.org/10.1038/s41586-021-03819-2
Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. doi.org/10.1126/science.ade2574
Dalla-Torre, H. et al. (2023). The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. Nature Methods. doi.org/10.1101/2023.01.11.523679
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arxiv.org/abs/1409.0473
Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of FAccT 2021. doi.org/10.1145/3442188.3445922
McCulloch, W.S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133.
Hinton, G.E., Rumelhart, D.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR 2013. arxiv.org/abs/1301.3781
Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread. transformer-circuits.pub
Lindsey, J. et al. (2025). Biology of a Large Language Model. Anthropic Research. transformer-circuits.pub
Gemini Team, Google. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805. arxiv.org/abs/2312.11805
OpenAI. (2023). GPT-4 Technical Report. arxiv.org/abs/2303.08774
Anthropic. (2026, April 7). Claude Mythos Preview System Card. anthropic.com (System Card PDF)
Anthropic. (2026, June 9). Introducing Claude Fable 5 and Claude Mythos 5. Claude API Documentation. platform.claude.com
CrowdStrike. (2026). Anthropic Claude Mythos Preview — CrowdStrike Founding Member. crowdstrike.com
Centre for Emerging Technology and Security. (2026, May). Claude Mythos: What Does Anthropic’s New Model Mean for the Future of Cybersecurity? The Alan Turing Institute. cetas.turing.ac.uk
Kingma, D.P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. arxiv.org/abs/1412.6980

Share on Facebook

Post on X

Save

Discover more from Web News For Us

Subscribe to get the latest posts sent to your email.

How Large Language Models Actually Work: The Science Behind ChatGPT, Claude and Gemini

Before the Transformer: The Long Road from Neurons to Neural Networks

The 2017 Paper That Changed Everything: Attention Is All You Need

What “Attention” Actually Means — and Why It Is the Key to Everything

Tokens, Embeddings and the Strange Way Language Models Read