Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.
A large language model (LLM) is a neural network with billions of parameters, trained on massive amounts of text to predict the next word in a sequence. This seemingly simple task — predicting what word comes next — produces surprising emergent capabilities when scaled sufficiently.
Modern LLMs don't just complete text: they follow complex instructions, reason step by step, write code, translate between languages, and maintain coherent long-context conversations.
Introduced in the paper "Attention Is All You Need" (2017), the Transformer architecture replaced recurrent networks with an attention mechanism that allows the model to consider all words in a sequence simultaneously, capturing long-range relationships.
Key components:
As models scale, capabilities emerge that weren't explicitly programmed:
| Model | Organization | Characteristics |
|---|---|---|
| GPT-4o | OpenAI | Multimodal, advanced reasoning |
| Claude | Anthropic | Long context (200K tokens), safety |
| Gemini | Native multimodal, search integration | |
| Llama | Meta | Open-source, active community |
| Mistral | Mistral AI | Efficient, competitive open models |
| Command R | Cohere | Optimized for RAG and enterprise |
Choosing the right model depends on the use case, not the highest benchmark:
| Criterion | Large model (GPT-4o, Claude Sonnet) | Small model (Llama 8B, Mistral 7B) |
|---|---|---|
| Complex reasoning | Best performance | Sufficient for simple tasks |
| Latency | 1-5s per response | Under 500ms, ideal for real-time |
| Cost per million tokens | $2-15 input | $0.10-0.50 or free (self-hosted) |
| Data privacy | Data leaves to API | Self-hosted, internal data |
| Fine-tuning | Expensive, limited | Accessible with LoRA/QLoRA |
The current trend is using large models for complex tasks and specialized small models for high-volume repetitive tasks — a pattern that reduces costs without sacrificing quality where it matters.
LLMs are the foundational technology behind the current artificial intelligence revolution. They're the engine powering AI agents, prompt engineering techniques, and semantic search systems. Understanding how they work — and their limitations — is essential for building effective AI applications.
Computational models inspired by brain structure that learn patterns from data, forming the foundation of modern artificial intelligence systems.
Field of computer science dedicated to creating systems capable of performing tasks that normally require human intelligence, from reasoning and perception to language generation.
The discipline of designing effective instructions for language models, combining clarity, structure, and examples to obtain consistent, high-quality responses.
Process of specializing a pre-trained model for a specific task or domain through additional training with curated data, adapting its behavior without starting from scratch.
Process of splitting text into discrete units (tokens) that language models can process numerically, fundamental to how LLMs understand and generate text.
Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.
AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.
The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.
Algorithmically generated data that replicates the statistical properties of real data, used to train, evaluate, and test AI systems when real data is scarce, expensive, or sensitive.
Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.
Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.
LLM capability to generate structured calls to external functions based on natural language, enabling integration with APIs, databases, and real-world tools.
Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.
Prompting technique that improves LLM reasoning by asking them to decompose complex problems into explicit intermediate steps before reaching a conclusion.