How AI Actually Works: What Happens Between Your Prompt and the Answer
No math required. An illustrated walk through how a large language model turns your words into tokens, vectors, and attention — and predicts its answer one word at a time.
Ask a chatbot a question and it answers in fluent, confident sentences — almost as if it understood you. So what is really happening in the half-second between your prompt and its reply? There is no little person inside, and no database of pre-written answers. Underneath the magic is one surprisingly simple idea, repeated at enormous scale. This article walks through that idea and the machinery that makes it work, using the kind of model behind ChatGPT, Claude, and Gemini — a large language model, or LLM.
We will follow your words all the way through: how they are chopped into pieces, turned into numbers, read in context, pushed through a giant network, and finally turned back into the next word of an answer. No math required.
The one idea behind it all: predict the next word
At its core, an LLM does exactly one thing: given some text, it predicts what word is most likely to come next. That is it. "The sky is ___" and the model assigns a high probability to blue, a lower one to clear, a tiny one to banana. It is autocomplete — but trained on a huge slice of the internet, books, and code, so its sense of "what comes next" captures grammar, facts, tone, and reasoning patterns. Everything else here is just the apparatus that makes that one prediction astonishingly good.
Step 1: your words become tokens
Computers do not work with letters; they work with numbers. So the first thing that happens is tokenization: your text is split into chunks called tokens — often whole words, sometimes pieces of words — and each token is swapped for a number (an ID) from a fixed vocabulary. "How are you?" might become four tokens with four IDs. From this point on, the model is only ever shuffling numbers around.
Step 2: tokens become vectors of meaning
A token ID by itself is just a label. To reason about meaning, the model converts each token into an embedding — a long list of numbers (a vector) that acts like coordinates in a vast "meaning space." The trick is that words used in similar ways end up near each other: cat, dog, and puppy cluster together, while king and queen sit in their own neighborhood. Direction matters too, which is why these vectors can capture analogies. This is how raw symbols start to carry meaning the model can compute with.
Step 3: attention reads the context
Words only mean something in context — "bank" by a river is not "bank" with your money. The breakthrough that powers modern AI is a mechanism called attention, the heart of the Transformer architecture introduced in 2017. For every word, attention lets the model look at all the other words and decide which ones matter for interpreting it. In "The cat sat because it was tired," attention is how the model figures out that "it" refers to the cat. Every token's representation is updated by mixing in the tokens it pays attention to.
Inside the model: layers that refine a guess
A single round of attention is not enough. The model stacks dozens to hundreds of identical layers, each made of an attention step and a small neural network (a "feed-forward" step). Information flows up through the stack, and at every layer the representation of each token gets a little richer — from raw words, to phrases, to grammar, to gist. The numbers that do all this adjusting are the model's parameters, or weights: there are billions of them, and they are what the model "learned." After the final layer, the model outputs a probability for every possible next token in its vocabulary.
Where the "knowledge" comes from: training
The model is not programmed with facts; it learns them. During pretraining, it reads trillions of words and plays one endless game: cover the next word, guess it, check the real answer, and nudge its billions of weights to be a little less wrong. Repeat at vast scale and the weights gradually encode grammar, facts, and patterns of reasoning. A second stage, fine-tuning — including learning from human feedback (RLHF) — teaches the raw predictor to be a helpful, honest, harmless assistant that follows instructions instead of just continuing your text. That is the difference between a model that completes "Why is the sky" with another question and one that actually answers it.
How it actually answers you
Here is the part that surprises people: the model does not write a whole answer at once. It generates one token at a time. It takes everything so far, predicts the most likely next token, picks one, appends it, and then feeds the longer text back in to predict the token after that. This loop — called autoregressive generation — repeats until the answer is complete. A setting called temperature controls how boldly it samples: low temperature picks the safest word every time, higher temperature allows more variety and surprise.
What this means in practice
Seeing the machinery demystifies the behavior. The model is a brilliant predictor of plausible text, not a truth oracle — so when it is unsure it can still produce fluent, confident, and wrong answers, which we call hallucinations. It only "knows" what fits in its context window (the prompt plus the conversation so far) and the patterns frozen into its weights at training time; it has no live memory of you between sessions unless the app adds one. And because everything hinges on the text you provide, a clearer, more specific prompt genuinely produces a better answer.
None of this makes it less remarkable. That "just predict the next word," scaled to billions of parameters and trained on much of what humanity has written, can draft an email, debug code, or explain itself is one of the more startling results in modern computing. But it is engineering, not sorcery — and now you can see the gears turning.
Comments
Share your thoughts. Be kind.
Loading comments…