Chain-of-Thought

What it is

Chain-of-Thought (CoT) is a prompt engineering technique that dramatically improves large language model reasoning capability by asking them to "think step by step." Instead of jumping directly to the answer, the model generates intermediate reasoning steps that guide it toward more precise and verifiable conclusions.

Introduced by Wei et al. in 2022, CoT demonstrated that large models can solve math, logic, and common sense problems that they previously failed consistently. The technique works because LLMs predict tokens sequentially — when they generate intermediate steps, each step provides crucial context for the next, making the reasoning process visible and allowing real-time corrections.

The impact is significant: on the GSM8K math benchmark, GPT-3 (175B) improved from 17.7% accuracy with standard prompting to 58.1% with CoT. On common sense problems (CommonsenseQA), the improvement was from 76.0% to 78.7%.

Main variants

Zero-shot CoT

The simplest implementation: adding "Let's think step by step" to the end of the prompt. Works surprisingly well without examples:

# Standard prompt
prompt = "If a store has 15 apples and sells 7, how many are left?"
# Typical response: "8" (without showing work)
 
# Zero-shot CoT
prompt = """If a store has 15 apples and sells 7, how many are left?
Let's think step by step."""
# Typical response:
# "We start with 15 apples.
# We sell 7 apples.
# 15 - 7 = 8
# There are 8 apples left."

Few-shot CoT

Provide examples with explicit reasoning before the target question:

prompt = """
Example: If I have 10 oranges and eat 3, how many are left?
Step 1: I start with 10 oranges
Step 2: I eat 3 oranges
Step 3: 10 - 3 = 7
Answer: 7 oranges are left.
 
Question: If a library has 45 books and lends 18, how many are left?
"""

Self-Consistency CoT

Generate multiple independent reasoning chains and choose the most frequent answer. Significantly improves accuracy:

import openai
from collections import Counter
 
def self_consistency_cot(prompt, n_samples=5):
    responses = []
    for _ in range(n_samples):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": f"{prompt}\nLet's think step by step."}],
            temperature=0.7  # Variability for different chains
        )
        responses.append(extract_final_answer(response.choices[0].message.content))
    
    # Majority voting
    return Counter(responses).most_common(1)[0][0]

On the GSM8K benchmark, self-consistency improved standard CoT accuracy from 58.1% to 74.4%.

Tree of Thoughts (ToT)

Explore multiple reasoning branches in parallel, evaluating and pruning less promising paths:

def tree_of_thoughts(problem, max_depth=3):
    def generate_thoughts(state, depth):
        if depth >= max_depth:
            return [evaluate_solution(state)]
        
        # Generate multiple candidate thoughts
        thoughts = llm.generate_thoughts(state, n=3)
        
        # Evaluate each thought
        scored_thoughts = [(t, evaluate_thought(t)) for t in thoughts]
        
        # Select the best ones to continue
        best_thoughts = sorted(scored_thoughts, key=lambda x: x[1], reverse=True)[:2]
        
        results = []
        for thought, score in best_thoughts:
            new_state = state + [thought]
            results.extend(generate_thoughts(new_state, depth + 1))
        
        return results
    
    return generate_thoughts([problem], 0)

Performance comparison

Technique	GSM8K (math)	CommonsenseQA	Latency	Cost
Standard	17.7%	76.0%	1x	1x
Zero-shot CoT	58.1%	78.7%	2-3x	2-3x
Few-shot CoT	65.2%	82.1%	2-3x	2-3x
Self-Consistency	74.4%	85.3%	10-15x	10-15x
Tree of Thoughts	78.9%	87.2%	20-50x	20-50x

When to use each variant

Zero-shot CoT: Simple to medium problems, when cost is a primary concern.

Few-shot CoT: Specific domains where you have high-quality examples, problems requiring specific formatting.

Self-Consistency: Critical problems where accuracy is more important than cost, high-impact decisions.

Tree of Thoughts: Complex planning problems, creative search, when you need to explore multiple approaches.

Integration with agents

AI agents use CoT internally to plan actions and use tools. In function calling, CoT helps the model decide which function to call and with what parameters:

system_prompt = """
You are an assistant that can use tools. For each request:
1. Analyze what information you need
2. Determine which tools to use
3. Plan the sequence of calls
4. Execute step by step
 
Available tools: get_weather, send_email, search_web
"""
 
user_prompt = """
I need to send an email to my team about tomorrow's weather in Madrid.
Let's think step by step.
"""

Limitations and anti-patterns

When CoT can hurt performance

Trivial problems: For "2+2=?", CoT adds unnecessary complexity
Small models: CoT requires models >10B parameters to be effective
Domains without logical structure: Purely creative or subjective tasks

Anti-pattern: False but plausible reasoning

# Problem: The model can generate logical but incorrect steps
prompt = "How many days are in February 2023?"
 
# Incorrect but plausible response:
# "Step 1: 2023 is not divisible by 4
# Step 2: Therefore, 2023 is not a leap year
# Step 3: February in non-leap years has 28 days
# Answer: 28 days"
# 
# ERROR: 2023 DOES have 28 days, but the divisibility reasoning is correct

Hallucination mitigation

Verify basic facts independently
Use self-consistency to detect inconsistencies
Implement external validation of critical steps

Why it matters

Chain-of-thought represents a fundamental shift in how we interact with LLMs — from "question-answer" to "collaborative reasoning." For senior engineers building AI systems, mastering CoT is critical because:

Improves reliability: Intermediate steps make the decision process auditable, crucial for high-risk applications. Reduces long-term costs: While CoT uses more tokens per query, it reduces debugging iterations and prompt refinement. Enables complex use cases: Problems that previously required multiple model calls can now be solved in a single structured reasoning session.

The difference between a system that "works sometimes" and one that "works consistently" often lies in the correct implementation of CoT and its variants.

References

[2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al., 2022. Original paper introducing CoT.
[2203.11171] Self-Consistency Improves Chain of Thought Reasoning in Language Models — Wang et al., 2022. Majority voting technique.
[2305.10601] Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Yao et al., 2023. Multi-branch reasoning exploration.
Language Models Perform Reasoning via Chain of Thought — Google Research Blog, 2022. Accessible explanation of findings.
Prompt design strategies — Google AI for Developers. Practical implementation guide.
Chain-of-Thought Prompting — Learn Prompting, 2023. Tutorial with practical examples.
[2210.03493] Automatic Chain of Thought Prompting in Large Language Models — Zhang et al., 2022. Automation of CoT example generation.

What it is

Main variants

Zero-shot CoT

The simplest implementation: adding "Let's think step by step" to the end of the prompt. Works surprisingly well without examples:

# Standard prompt
prompt = "If a store has 15 apples and sells 7, how many are left?"
# Typical response: "8" (without showing work)
 
# Zero-shot CoT
prompt = """If a store has 15 apples and sells 7, how many are left?
Let's think step by step."""
# Typical response:
# "We start with 15 apples.
# We sell 7 apples.
# 15 - 7 = 8
# There are 8 apples left."

Few-shot CoT

Provide examples with explicit reasoning before the target question:

prompt = """
Example: If I have 10 oranges and eat 3, how many are left?
Step 1: I start with 10 oranges
Step 2: I eat 3 oranges
Step 3: 10 - 3 = 7
Answer: 7 oranges are left.
 
Question: If a library has 45 books and lends 18, how many are left?
"""

Self-Consistency CoT

Generate multiple independent reasoning chains and choose the most frequent answer. Significantly improves accuracy:

import openai
from collections import Counter
 
def self_consistency_cot(prompt, n_samples=5):
    responses = []
    for _ in range(n_samples):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": f"{prompt}\nLet's think step by step."}],
            temperature=0.7  # Variability for different chains
        )
        responses.append(extract_final_answer(response.choices[0].message.content))
    
    # Majority voting
    return Counter(responses).most_common(1)[0][0]

On the GSM8K benchmark, self-consistency improved standard CoT accuracy from 58.1% to 74.4%.

Tree of Thoughts (ToT)

Explore multiple reasoning branches in parallel, evaluating and pruning less promising paths:

def tree_of_thoughts(problem, max_depth=3):
    def generate_thoughts(state, depth):
        if depth >= max_depth:
            return [evaluate_solution(state)]
        
        # Generate multiple candidate thoughts
        thoughts = llm.generate_thoughts(state, n=3)
        
        # Evaluate each thought
        scored_thoughts = [(t, evaluate_thought(t)) for t in thoughts]
        
        # Select the best ones to continue
        best_thoughts = sorted(scored_thoughts, key=lambda x: x[1], reverse=True)[:2]
        
        results = []
        for thought, score in best_thoughts:
            new_state = state + [thought]
            results.extend(generate_thoughts(new_state, depth + 1))
        
        return results
    
    return generate_thoughts([problem], 0)

Performance comparison

Technique	GSM8K (math)	CommonsenseQA	Latency	Cost
Standard	17.7%	76.0%	1x	1x
Zero-shot CoT	58.1%	78.7%	2-3x	2-3x
Few-shot CoT	65.2%	82.1%	2-3x	2-3x
Self-Consistency	74.4%	85.3%	10-15x	10-15x
Tree of Thoughts	78.9%	87.2%	20-50x	20-50x

When to use each variant

Zero-shot CoT: Simple to medium problems, when cost is a primary concern.

Few-shot CoT: Specific domains where you have high-quality examples, problems requiring specific formatting.

Self-Consistency: Critical problems where accuracy is more important than cost, high-impact decisions.

Tree of Thoughts: Complex planning problems, creative search, when you need to explore multiple approaches.

Integration with agents

AI agents use CoT internally to plan actions and use tools. In function calling, CoT helps the model decide which function to call and with what parameters:

system_prompt = """
You are an assistant that can use tools. For each request:
1. Analyze what information you need
2. Determine which tools to use
3. Plan the sequence of calls
4. Execute step by step
 
Available tools: get_weather, send_email, search_web
"""
 
user_prompt = """
I need to send an email to my team about tomorrow's weather in Madrid.
Let's think step by step.
"""

Limitations and anti-patterns

When CoT can hurt performance

Trivial problems: For "2+2=?", CoT adds unnecessary complexity
Small models: CoT requires models >10B parameters to be effective
Domains without logical structure: Purely creative or subjective tasks

Anti-pattern: False but plausible reasoning

# Problem: The model can generate logical but incorrect steps
prompt = "How many days are in February 2023?"
 
# Incorrect but plausible response:
# "Step 1: 2023 is not divisible by 4
# Step 2: Therefore, 2023 is not a leap year
# Step 3: February in non-leap years has 28 days
# Answer: 28 days"
# 
# ERROR: 2023 DOES have 28 days, but the divisibility reasoning is correct

Hallucination mitigation

Verify basic facts independently
Use self-consistency to detect inconsistencies
Implement external validation of critical steps

Why it matters

The difference between a system that "works sometimes" and one that "works consistently" often lies in the correct implementation of CoT and its variants.

References

[2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al., 2022. Original paper introducing CoT.
[2203.11171] Self-Consistency Improves Chain of Thought Reasoning in Language Models — Wang et al., 2022. Majority voting technique.
[2305.10601] Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Yao et al., 2023. Multi-branch reasoning exploration.
Language Models Perform Reasoning via Chain of Thought — Google Research Blog, 2022. Accessible explanation of findings.
Prompt design strategies — Google AI for Developers. Practical implementation guide.
Chain-of-Thought Prompting — Learn Prompting, 2023. Tutorial with practical examples.
[2210.03493] Automatic Chain of Thought Prompting in Large Language Models — Zhang et al., 2022. Automation of CoT example generation.

Chain-of-Thought

What it is

Main variants

Zero-shot CoT

Few-shot CoT

Self-Consistency CoT

Tree of Thoughts (ToT)

Performance comparison

When to use each variant

Integration with agents

Limitations and anti-patterns

When CoT can hurt performance

Anti-pattern: False but plausible reasoning

Hallucination mitigation

Why it matters

References

Related content

Chain-of-Thought

What it is

Main variants

Zero-shot CoT

Few-shot CoT

Self-Consistency CoT

Tree of Thoughts (ToT)

Performance comparison

When to use each variant

Integration with agents

Limitations and anti-patterns

When CoT can hurt performance

Anti-pattern: False but plausible reasoning

Hallucination mitigation

Why it matters

References

Related content