Research·6 min read·BitsMinds

How AI Actually Works: What Happens Between Your Prompt and the Answer

No math required. An illustrated walk through how a large language model turns your words into tokens, vectors, and attention — and predicts its answer one word at a time.

AI · EXPLAINEDHOW AIWORKSBITSMINDS · AI EXPLAINEDtokens → vectors → attention → answerBITSMINDS.COM
Share:

Ask a chatbot a question and it answers in fluent, confident sentences — almost as if it understood you. So what is really happening in the half-second between your prompt and its reply? There is no little person inside, and no database of pre-written answers. Underneath the magic is one surprisingly simple idea, repeated at enormous scale. This article walks through that idea and the machinery that makes it work, using the kind of model behind ChatGPT, Claude, and Gemini — a large language model, or LLM.

We will follow your words all the way through: how they are chopped into pieces, turned into numbers, read in context, pushed through a giant network, and finally turned back into the next word of an answer. No math required.

The one idea behind it all: predict the next word

At its core, an LLM does exactly one thing: given some text, it predicts what word is most likely to come next. That is it. "The sky is ___" and the model assigns a high probability to blue, a lower one to clear, a tiny one to banana. It is autocomplete — but trained on a huge slice of the internet, books, and code, so its sense of "what comes next" captures grammar, facts, tone, and reasoning patterns. Everything else here is just the apparatus that makes that one prediction astonishingly good.

Step 1: your words become tokens

Computers do not work with letters; they work with numbers. So the first thing that happens is tokenization: your text is split into chunks called tokens — often whole words, sometimes pieces of words — and each token is swapped for a number (an ID) from a fixed vocabulary. "How are you?" might become four tokens with four IDs. From this point on, the model is only ever shuffling numbers around.

Step 1 — Text becomes tokens "How are you?" How 4521 are 553 you 481 ? 30 The model only ever sees these token IDs, never the letters. BITSMINDS.COM
Tokenization splits your text into tokens and maps each to a numeric ID. The model reads the IDs, not the letters. Diagram: BitsMinds

Step 2: tokens become vectors of meaning

A token ID by itself is just a label. To reason about meaning, the model converts each token into an embedding — a long list of numbers (a vector) that acts like coordinates in a vast "meaning space." The trick is that words used in similar ways end up near each other: cat, dog, and puppy cluster together, while king and queen sit in their own neighborhood. Direction matters too, which is why these vectors can capture analogies. This is how raw symbols start to carry meaning the model can compute with.

Step 2 — Tokens become vectors of meaning animals cat dog puppy royalty king queen prince car Words with related meaning sit close together; distance and direction encode meaning. BITSMINDS.COM
Each token becomes a vector. Words with related meaning land close together, so geometry stands in for meaning. Diagram: BitsMinds

Step 3: attention reads the context

Words only mean something in context — "bank" by a river is not "bank" with your money. The breakthrough that powers modern AI is a mechanism called attention, the heart of the Transformer architecture introduced in 2017. For every word, attention lets the model look at all the other words and decide which ones matter for interpreting it. In "The cat sat because it was tired," attention is how the model figures out that "it" refers to the cat. Every token's representation is updated by mixing in the tokens it pays attention to.

Step 3 — Attention reads the context To understand "it", the model weighs every other word. The cat sat because it was tired It attends most strongly to "cat" — so it learns what "it" refers to. BITSMINDS.COM
Attention lets each word look at every other word and weigh which ones matter — here, resolving what "it" refers to. Diagram: BitsMinds

Inside the model: layers that refine a guess

A single round of attention is not enough. The model stacks dozens to hundreds of identical layers, each made of an attention step and a small neural network (a "feed-forward" step). Information flows up through the stack, and at every layer the representation of each token gets a little richer — from raw words, to phrases, to grammar, to gist. The numbers that do all this adjusting are the model's parameters, or weights: there are billions of them, and they are what the model "learned." After the final layer, the model outputs a probability for every possible next token in its vocabulary.

Inside the model — from prompt to prediction Tokens How are you Vectors numbers that carry meaning Transformer layers Self-attention Feed-forward repeated 30–100× (this is where the weights live) Next token blue 0.62 clear 0.14 grey 0.07 Each layer refines the guess; the final one gives a probability for every word. BITSMINDS.COM
Tokens become vectors, flow through a deep stack of attention and feed-forward layers, and emerge as a probability for every possible next word. Diagram: BitsMinds

Where the "knowledge" comes from: training

The model is not programmed with facts; it learns them. During pretraining, it reads trillions of words and plays one endless game: cover the next word, guess it, check the real answer, and nudge its billions of weights to be a little less wrong. Repeat at vast scale and the weights gradually encode grammar, facts, and patterns of reasoning. A second stage, fine-tuning — including learning from human feedback (RLHF) — teaches the raw predictor to be a helpful, honest, harmless assistant that follows instructions instead of just continuing your text. That is the difference between a model that completes "Why is the sky" with another question and one that actually answers it.

How it actually answers you

Here is the part that surprises people: the model does not write a whole answer at once. It generates one token at a time. It takes everything so far, predicts the most likely next token, picks one, appends it, and then feeds the longer text back in to predict the token after that. This loop — called autoregressive generation — repeats until the answer is complete. A setting called temperature controls how boldly it samples: low temperature picks the safest word every time, higher temperature allows more variety and surprise.

How the answer is built — one token at a time append the new word and run again Text so far The sky is ___ AI model scores every word It picks the most likely: blue 0.62 clear 0.14 grey 0.07 The loop runs until the reply is finished. "Temperature" controls how adventurous the pick is. BITSMINDS.COM
Generation is a loop: predict the next token, append it, and run again — until the reply is done. Diagram: BitsMinds

What this means in practice

Seeing the machinery demystifies the behavior. The model is a brilliant predictor of plausible text, not a truth oracle — so when it is unsure it can still produce fluent, confident, and wrong answers, which we call hallucinations. It only "knows" what fits in its context window (the prompt plus the conversation so far) and the patterns frozen into its weights at training time; it has no live memory of you between sessions unless the app adds one. And because everything hinges on the text you provide, a clearer, more specific prompt genuinely produces a better answer.

None of this makes it less remarkable. That "just predict the next word," scaled to billions of parameters and trained on much of what humanity has written, can draft an email, debug code, or explain itself is one of the more startling results in modern computing. But it is engineering, not sorcery — and now you can see the gears turning.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles

MICROSOFT BUILD 2026 · TOPOLOGICAL QUANTUM 1,000× more stable. Majorana 2 — a better topological qubit. New materials stack: lead · InAs / InAsSb ROADMAP HALVED → PRACTICAL MACHINE BY 2029 Microsoft MAJORANA 1 parity lifetime: milliseconds MAJORANA 2 ~20 s parity lifetime · up to 60 s BITSMINDS.COM Microsoft's claims · pending independent peer review
Research

Microsoft’s Majorana 2 Claims a 1,000× Leap in Qubit Stability — and Pulls Its Quantum Roadmap Forward to 2029

GTC TAIPEI · COMPUTEX 2026 · PHYSICAL AI OPEN STACK Physical AI, fully open. One model to perceive, simulate, and act. Cosmos 3 omnimodel · reason, then generate action OPEN WEIGHTS · NANO 16B · SUPER 64B · EDGE 2B SOON COSMOS 3First fully open physical-AI omnimodelNANO · SUPER16B (8B+8B) · 64B (32B+32B) · open weightsTHINK, THEN ACTReason block reads scene → action block actsBENCHMARKS#1 on Physics-IQ, R-Bench, PAI-BenchISAAC GR00TOpen 6-ft humanoid · 31 DoF · runs on Thor BITSMINDS.COM Source: NVIDIA blog · The Robot Report
Research

Nvidia Open-Sources Cosmos 3, Its “Think-Then-Act” Physical-AI Omnimodel — and a 6-Foot GR00T Humanoid to Run It

PROJECT GLASSWING · CLAUDE MYTHOS ANTHROPIC 10,000+ high/critical-severity zero-days found in its first month, by an unreleased Claude Mythos ⚠ Withheld — no safe way to release it yet ONE MONTH OF MYTHOS 23,019 issues · 1,000+ projects scanned 90%+ true-positive · 13 defenders enrolled MARQUEE FINDINGS 27-year-old flaw in hardened OpenBSD wolfSSL exploit forges TLS certificates Source: Anthropic · Help Net Security
Research

Anthropic’s Project Glasswing Turns an Unreleased Claude Mythos Loose on Open Source — and Finds 10,000+ Zero-Days in a Month