jonmatumalpha
conceptsnotesexperimentsessays

© 2026 Jonatan Mata · alpha · v0.1.0

Notes

Content Agent QA Review: PR #187

Findings from manual review of PR

growing#content-automation#qa#hallucination#agentic-workflows#lessons-learned

What happened

The content agent generated PR #187 to upgrade ai-coding-assistants from seed to evergreen. Manual review found three factual errors that the automated pipeline did not catch.

Findings

1. Hallucinated reference

The agent cited "The Programmer's Brain in the Era of AI" with URL research.google/pubs/pub52966/. That URL exists (HTTP 200), but points to a medical NLP paper titled "Structured Understanding of Assessment and Plans in Clinical Documentation" (Yaya-Stupp et al., medRxiv 2022). The title, year, and topic were fabricated.

Why it wasn't caught: the QA agent verifies that URLs return HTTP 200, but does not verify that the page content matches the cited title. A 200 does not mean the reference is correct.

2. Unsourced statistic

The "Why it matters" section claimed "productivity increases of 20-40%" without citing a specific study. The GitHub paper (Peng et al., 2023) measured 55.8% on a specific task. Google's internal study measured ~21%. No cited source supports the "20-40%" range.

Why it wasn't caught: the QA agent with --deep looks for "unsourced claims" but does not cross-check numbers in the text against actual figures in the cited references.

3. Incorrect pricing

The comparison table listed Kiro at $25/month. The actual price is $20/month (Pro) per kiro.dev/pricing. Prices change — the agent likely used stale training data.

Why it wasn't caught: there is no pricing verification in the pipeline. Prices are volatile data that the LLM cannot verify without real-time web access.

What worked well

  • ES↔EN structure was correct and symmetric
  • Cross-references expanded from 2 to 5 concepts, all valid
  • Internal links use /concepts/slug paths correctly
  • Comparison table with real tools is substantive
  • Security considerations section adds genuine depth
  • All 4 legitimate URLs returned HTTP 200

What needs to improve

Short term

ImprovementEffortImpact
Verify reference title appears on the pageMedium — requires fetch + text searchHigh — eliminates reference hallucinations
Cross-check text figures against cited referencesHigh — requires semantic understandingHigh — eliminates fabricated statistics
Add prompt warning about prices and volatile dataLow — prompt changeMedium — reduces stale data errors

Medium term

  • Semantic reference verification: after verifying HTTP 200, fetch the page <title> and compare against the cited title. If they don't match, flag as suspicious.
  • Quantitative claim validation: extract numbers from text and verify that at least one reference supports them. This requires an additional LLM step or an extraction heuristic.
  • Volatile data: maintain a verified data file (prices, versions, release dates) that the agent consults instead of relying on training data.

Corrections applied

  • Kiro pricing: $25 → $20 Pro
  • Hallucinated reference replaced with Peng et al., 2023 (actual Copilot experiment paper)
  • Weak SO reference replaced with Google Research ML code completion
  • "20-40%" figure replaced with 55.8% citing Peng et al.

References

  • PR #187: upgrade concepts/ai-coding-assistants to evergreen — jonmatum/jonmatum.com, 2026. The reviewed PR.
  • The Impact of AI on Developer Productivity: Evidence from GitHub Copilot — Peng et al., 2023. Controlled experiment measuring 55.8% improvement.
  • ML-Enhanced Code Completion Improves Developer Productivity — Google Research, 2022. Internal evaluation of ML-powered code completion.

Related content

  • AI Agents

    Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.

  • Hallucination Mitigation

    Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.

  • AI Coding Assistants

    Tools using LLMs to help developers write, understand, debug, and refactor code, from autocomplete to agents that implement complete features.

Notes