Prompting technique that improves LLM reasoning by asking them to decompose complex problems into explicit intermediate steps before reaching a conclusion.
Chain-of-Thought (CoT) is a prompt engineering technique that dramatically improves large language model reasoning capability by asking them to "think step by step." Instead of jumping directly to the answer, the model generates intermediate reasoning steps that guide it toward more precise and verifiable conclusions.
Introduced by Wei et al. in 2022, CoT demonstrated that large models can solve math, logic, and common sense problems that they previously failed consistently. The technique works because LLMs predict tokens sequentially — when they generate intermediate steps, each step provides crucial context for the next, making the reasoning process visible and allowing real-time corrections.
The impact is significant: on the GSM8K math benchmark, GPT-3 (175B) improved from 17.7% accuracy with standard prompting to 58.1% with CoT. On common sense problems (CommonsenseQA), the improvement was from 76.0% to 78.7%.
The simplest implementation: adding "Let's think step by step" to the end of the prompt. Works surprisingly well without examples:
# Standard prompt
prompt = "If a store has 15 apples and sells 7, how many are left?"
# Typical response: "8" (without showing work)
# Zero-shot CoT
prompt = """If a store has 15 apples and sells 7, how many are left?
Let's think step by step."""
# Typical response:
# "We start with 15 apples.
# We sell 7 apples.
# 15 - 7 = 8
# There are 8 apples left."Provide examples with explicit reasoning before the target question:
prompt = """
Example: If I have 10 oranges and eat 3, how many are left?
Step 1: I start with 10 oranges
Step 2: I eat 3 oranges
Step 3: 10 - 3 = 7
Answer: 7 oranges are left.
Question: If a library has 45 books and lends 18, how many are left?
"""Generate multiple independent reasoning chains and choose the most frequent answer. Significantly improves accuracy:
import openai
from collections import Counter
def self_consistency_cot(prompt, n_samples=5):
responses = []
for _ in range(n_samples):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"{prompt}\nLet's think step by step."}],
temperature=0.7 # Variability for different chains
)
responses.append(extract_final_answer(response.choices[0].message.content))
# Majority voting
return Counter(responses).most_common(1)[0][0]On the GSM8K benchmark, self-consistency improved standard CoT accuracy from 58.1% to 74.4%.
Explore multiple reasoning branches in parallel, evaluating and pruning less promising paths:
def tree_of_thoughts(problem, max_depth=3):
def generate_thoughts(state, depth):
if depth >= max_depth:
return [evaluate_solution(state)]
# Generate multiple candidate thoughts
thoughts = llm.generate_thoughts(state, n=3)
# Evaluate each thought
scored_thoughts = [(t, evaluate_thought(t)) for t in thoughts]
# Select the best ones to continue
best_thoughts = sorted(scored_thoughts, key=lambda x: x[1], reverse=True)[:2]
results = []
for thought, score in best_thoughts:
new_state = state + [thought]
results.extend(generate_thoughts(new_state, depth + 1))
return results
return generate_thoughts([problem], 0)| Technique | GSM8K (math) | CommonsenseQA | Latency | Cost |
|---|---|---|---|---|
| Standard | 17.7% | 76.0% | 1x | 1x |
| Zero-shot CoT | 58.1% | 78.7% | 2-3x | 2-3x |
| Few-shot CoT | 65.2% | 82.1% | 2-3x | 2-3x |
| Self-Consistency | 74.4% | 85.3% | 10-15x | 10-15x |
| Tree of Thoughts | 78.9% | 87.2% | 20-50x | 20-50x |
Zero-shot CoT: Simple to medium problems, when cost is a primary concern.
Few-shot CoT: Specific domains where you have high-quality examples, problems requiring specific formatting.
Self-Consistency: Critical problems where accuracy is more important than cost, high-impact decisions.
Tree of Thoughts: Complex planning problems, creative search, when you need to explore multiple approaches.
AI agents use CoT internally to plan actions and use tools. In function calling, CoT helps the model decide which function to call and with what parameters:
system_prompt = """
You are an assistant that can use tools. For each request:
1. Analyze what information you need
2. Determine which tools to use
3. Plan the sequence of calls
4. Execute step by step
Available tools: get_weather, send_email, search_web
"""
user_prompt = """
I need to send an email to my team about tomorrow's weather in Madrid.
Let's think step by step.
"""# Problem: The model can generate logical but incorrect steps
prompt = "How many days are in February 2023?"
# Incorrect but plausible response:
# "Step 1: 2023 is not divisible by 4
# Step 2: Therefore, 2023 is not a leap year
# Step 3: February in non-leap years has 28 days
# Answer: 28 days"
#
# ERROR: 2023 DOES have 28 days, but the divisibility reasoning is correctChain-of-thought represents a fundamental shift in how we interact with LLMs — from "question-answer" to "collaborative reasoning." For senior engineers building AI systems, mastering CoT is critical because:
Improves reliability: Intermediate steps make the decision process auditable, crucial for high-risk applications. Reduces long-term costs: While CoT uses more tokens per query, it reduces debugging iterations and prompt refinement. Enables complex use cases: Problems that previously required multiple model calls can now be solved in a single structured reasoning session.
The difference between a system that "works sometimes" and one that "works consistently" often lies in the correct implementation of CoT and its variants.
The discipline of designing effective instructions for language models, combining clarity, structure, and examples to obtain consistent, high-quality responses.
Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.
Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.
LLM capability to generate structured calls to external functions based on natural language, enabling integration with APIs, databases, and real-world tools.
Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.