Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Large Language Models

Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.

evergreen#llm#transformer#gpt#claude#foundation-models#deep-learning#nlp

What it is

A large language model (LLM) is a neural network with billions of parameters, trained on massive amounts of text to predict the next word in a sequence. This seemingly simple task — predicting what word comes next — produces surprising emergent capabilities when scaled sufficiently.

Modern LLMs don't just complete text: they follow complex instructions, reason step by step, write code, translate between languages, and maintain coherent long-context conversations.

How they work

The Transformer architecture

Introduced in the paper "Attention Is All You Need" (2017), the Transformer architecture replaced recurrent networks with an attention mechanism that allows the model to consider all words in a sequence simultaneously, capturing long-range relationships.

Key components:

  • Tokenization: text is split into tokens (subwords) that the model processes numerically
  • Embeddings: each token is converted into a dense vector capturing its semantic meaning
  • Attention layers: multiple layers that learn which parts of context are relevant for each prediction
  • Context window: the maximum number of tokens the model can process in a single inference

Two-phase training

  1. Pre-training: the model learns general language patterns by processing trillions of text tokens. This phase is extremely compute-intensive
  2. Fine-tuning: the model is specialized to follow instructions, align with human preferences (RLHF), or adapt to specific domains

Emergent capabilities

As models scale, capabilities emerge that weren't explicitly programmed:

  • Chain-of-Thought reasoning: ability to decompose complex problems into intermediate steps
  • In-Context Learning: learning from examples provided in the prompt without updating weights
  • Tool use: invoking APIs, executing code, or querying databases when configured to do so
  • Instruction following: interpreting and executing complex natural language instructions

Relevant models

ModelOrganizationCharacteristics
GPT-4oOpenAIMultimodal, advanced reasoning
ClaudeAnthropicLong context (200K tokens), safety
GeminiGoogleNative multimodal, search integration
LlamaMetaOpen-source, active community
MistralMistral AIEfficient, competitive open models
Command RCohereOptimized for RAG and enterprise

Model selection

Choosing the right model depends on the use case, not the highest benchmark:

CriterionLarge model (GPT-4o, Claude Sonnet)Small model (Llama 8B, Mistral 7B)
Complex reasoningBest performanceSufficient for simple tasks
Latency1-5s per responseUnder 500ms, ideal for real-time
Cost per million tokens$2-15 input$0.10-0.50 or free (self-hosted)
Data privacyData leaves to APISelf-hosted, internal data
Fine-tuningExpensive, limitedAccessible with LoRA/QLoRA

The current trend is using large models for complex tasks and specialized small models for high-volume repetitive tasks — a pattern that reduces costs without sacrificing quality where it matters.

Limitations

  • Hallucinations: generate plausible but incorrect information with high confidence
  • Static knowledge: their knowledge has a training cutoff date
  • Inference cost: larger models require specialized hardware
  • Finite context window: although growing, still a limitation for very long documents
  • Bias: reflect biases present in training data

Why it matters

LLMs are the foundational technology behind the current artificial intelligence revolution. They're the engine powering AI agents, prompt engineering techniques, and semantic search systems. Understanding how they work — and their limitations — is essential for building effective AI applications.

References

  • Attention Is All You Need — Vaswani et al., 2017. The paper that introduced the Transformer architecture.
  • Scaling Laws for Neural Language Models — Kaplan et al., 2020. Scaling laws that predict LLM performance.
  • Sparks of Artificial General Intelligence — Microsoft Research, 2023. Analysis of emergent capabilities in GPT-4.
  • LLaMA: Open and Efficient Foundation Language Models — Touvron et al. (Meta), 2023. The paper that democratized open-source LLMs.
  • LLM Tutorial — Hugging Face — Hugging Face, 2024. Practical guide for using LLMs with Transformers.

Related content

  • Neural Networks

    Computational models inspired by brain structure that learn patterns from data, forming the foundation of modern artificial intelligence systems.

  • Artificial Intelligence

    Field of computer science dedicated to creating systems capable of performing tasks that normally require human intelligence, from reasoning and perception to language generation.

  • Prompt Engineering

    The discipline of designing effective instructions for language models, combining clarity, structure, and examples to obtain consistent, high-quality responses.

  • Fine-Tuning

    Process of specializing a pre-trained model for a specific task or domain through additional training with curated data, adapting its behavior without starting from scratch.

  • Tokenization

    Process of splitting text into discrete units (tokens) that language models can process numerically, fundamental to how LLMs understand and generate text.

  • Hallucination Mitigation

    Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.

  • AWS Bedrock

    AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.

  • Context Windows

    The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.

  • Synthetic Data

    Algorithmically generated data that replicates the statistical properties of real data, used to train, evaluate, and test AI systems when real data is scarce, expensive, or sensitive.

  • Prompt Caching

    Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.

  • Inference Optimization

    Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.

  • Function Calling

    LLM capability to generate structured calls to external functions based on natural language, enabling integration with APIs, databases, and real-world tools.

  • Embeddings

    Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.

  • Chain-of-Thought

    Prompting technique that improves LLM reasoning by asking them to decompose complex problems into explicit intermediate steps before reaching a conclusion.

Concepts