Large Language Models

What it is

A large language model (LLM) is a neural network with billions of parameters, trained on massive amounts of text to predict the next word in a sequence. This seemingly simple task — predicting what word comes next — produces surprising emergent capabilities when scaled sufficiently.

Modern LLMs don't just complete text: they follow complex instructions, reason step by step, write code, translate between languages, and maintain coherent long-context conversations.

How they work

The Transformer architecture

Introduced in the paper "Attention Is All You Need" (2017), the Transformer architecture replaced recurrent networks with an attention mechanism that allows the model to consider all words in a sequence simultaneously, capturing long-range relationships.

Key components:

Tokenization: text is split into tokens (subwords) that the model processes numerically
Embeddings: each token is converted into a dense vector capturing its semantic meaning
Attention layers: multiple layers that learn which parts of context are relevant for each prediction
Context window: the maximum number of tokens the model can process in a single inference

Two-phase training

Pre-training: the model learns general language patterns by processing trillions of text tokens. This phase is extremely compute-intensive
Fine-tuning: the model is specialized to follow instructions, align with human preferences (RLHF), or adapt to specific domains

Emergent capabilities

As models scale, capabilities emerge that weren't explicitly programmed:

Chain-of-Thought reasoning: ability to decompose complex problems into intermediate steps
In-Context Learning: learning from examples provided in the prompt without updating weights
Tool use: invoking APIs, executing code, or querying databases when configured to do so
Instruction following: interpreting and executing complex natural language instructions

Relevant models

Model	Organization	Characteristics
GPT-4o	OpenAI	Multimodal, advanced reasoning
Claude	Anthropic	Long context (200K tokens), safety
Gemini	Google	Native multimodal, search integration
Llama	Meta	Open-source, active community
Mistral	Mistral AI	Efficient, competitive open models
Command R	Cohere	Optimized for RAG and enterprise

Model selection

Choosing the right model depends on the use case, not the highest benchmark:

Criterion	Large model (GPT-4o, Claude Sonnet)	Small model (Llama 8B, Mistral 7B)
Complex reasoning	Best performance	Sufficient for simple tasks
Latency	1-5s per response	Under 500ms, ideal for real-time
Cost per million tokens	$2-15 input	$0.10-0.50 or free (self-hosted)
Data privacy	Data leaves to API	Self-hosted, internal data
Fine-tuning	Expensive, limited	Accessible with LoRA/QLoRA

The current trend is using large models for complex tasks and specialized small models for high-volume repetitive tasks — a pattern that reduces costs without sacrificing quality where it matters.

Limitations

Hallucinations: generate plausible but incorrect information with high confidence
Static knowledge: their knowledge has a training cutoff date
Inference cost: larger models require specialized hardware
Finite context window: although growing, still a limitation for very long documents
Bias: reflect biases present in training data

Why it matters

LLMs are the foundational technology behind the current artificial intelligence revolution. They're the engine powering AI agents, prompt engineering techniques, and semantic search systems. Understanding how they work — and their limitations — is essential for building effective AI applications.

References

Attention Is All You Need — Vaswani et al., 2017. The paper that introduced the Transformer architecture.
Scaling Laws for Neural Language Models — Kaplan et al., 2020. Scaling laws that predict LLM performance.
Sparks of Artificial General Intelligence — Microsoft Research, 2023. Analysis of emergent capabilities in GPT-4.
LLaMA: Open and Efficient Foundation Language Models — Touvron et al. (Meta), 2023. The paper that democratized open-source LLMs.
LLM Tutorial — Hugging Face — Hugging Face, 2024. Practical guide for using LLMs with Transformers.

What it is

Modern LLMs don't just complete text: they follow complex instructions, reason step by step, write code, translate between languages, and maintain coherent long-context conversations.

How they work

The Transformer architecture

Key components:

Tokenization: text is split into tokens (subwords) that the model processes numerically
Embeddings: each token is converted into a dense vector capturing its semantic meaning
Attention layers: multiple layers that learn which parts of context are relevant for each prediction
Context window: the maximum number of tokens the model can process in a single inference

Two-phase training

Pre-training: the model learns general language patterns by processing trillions of text tokens. This phase is extremely compute-intensive
Fine-tuning: the model is specialized to follow instructions, align with human preferences (RLHF), or adapt to specific domains

Emergent capabilities

As models scale, capabilities emerge that weren't explicitly programmed:

Chain-of-Thought reasoning: ability to decompose complex problems into intermediate steps
In-Context Learning: learning from examples provided in the prompt without updating weights
Tool use: invoking APIs, executing code, or querying databases when configured to do so
Instruction following: interpreting and executing complex natural language instructions

Relevant models

Model	Organization	Characteristics
GPT-4o	OpenAI	Multimodal, advanced reasoning
Claude	Anthropic	Long context (200K tokens), safety
Gemini	Google	Native multimodal, search integration
Llama	Meta	Open-source, active community
Mistral	Mistral AI	Efficient, competitive open models
Command R	Cohere	Optimized for RAG and enterprise

Model selection

Choosing the right model depends on the use case, not the highest benchmark:

Criterion	Large model (GPT-4o, Claude Sonnet)	Small model (Llama 8B, Mistral 7B)
Complex reasoning	Best performance	Sufficient for simple tasks
Latency	1-5s per response	Under 500ms, ideal for real-time
Cost per million tokens	$2-15 input	$0.10-0.50 or free (self-hosted)
Data privacy	Data leaves to API	Self-hosted, internal data
Fine-tuning	Expensive, limited	Accessible with LoRA/QLoRA

The current trend is using large models for complex tasks and specialized small models for high-volume repetitive tasks — a pattern that reduces costs without sacrificing quality where it matters.

Limitations

Hallucinations: generate plausible but incorrect information with high confidence
Static knowledge: their knowledge has a training cutoff date
Inference cost: larger models require specialized hardware
Finite context window: although growing, still a limitation for very long documents
Bias: reflect biases present in training data

Why it matters

References

Attention Is All You Need — Vaswani et al., 2017. The paper that introduced the Transformer architecture.
Scaling Laws for Neural Language Models — Kaplan et al., 2020. Scaling laws that predict LLM performance.
Sparks of Artificial General Intelligence — Microsoft Research, 2023. Analysis of emergent capabilities in GPT-4.
LLaMA: Open and Efficient Foundation Language Models — Touvron et al. (Meta), 2023. The paper that democratized open-source LLMs.
LLM Tutorial — Hugging Face — Hugging Face, 2024. Practical guide for using LLMs with Transformers.

Large Language Models

What it is

How they work

The Transformer architecture

Two-phase training

Emergent capabilities

Relevant models

Model selection

Limitations

Why it matters

References

Related content

Large Language Models

What it is

How they work

The Transformer architecture

Two-phase training

Emergent capabilities

Relevant models

Model selection

Limitations

Why it matters

References

Related content