Findings from manual review of PR
The content agent generated PR #187 to upgrade ai-coding-assistants from seed to evergreen. Manual review found three factual errors that the automated pipeline did not catch.
The agent cited "The Programmer's Brain in the Era of AI" with URL research.google/pubs/pub52966/. That URL exists (HTTP 200), but points to a medical NLP paper titled "Structured Understanding of Assessment and Plans in Clinical Documentation" (Yaya-Stupp et al., medRxiv 2022). The title, year, and topic were fabricated.
Why it wasn't caught: the QA agent verifies that URLs return HTTP 200, but does not verify that the page content matches the cited title. A 200 does not mean the reference is correct.
The "Why it matters" section claimed "productivity increases of 20-40%" without citing a specific study. The GitHub paper (Peng et al., 2023) measured 55.8% on a specific task. Google's internal study measured ~21%. No cited source supports the "20-40%" range.
Why it wasn't caught: the QA agent with --deep looks for "unsourced claims" but does not cross-check numbers in the text against actual figures in the cited references.
The comparison table listed Kiro at $25/month. The actual price is $20/month (Pro) per kiro.dev/pricing. Prices change — the agent likely used stale training data.
Why it wasn't caught: there is no pricing verification in the pipeline. Prices are volatile data that the LLM cannot verify without real-time web access.
/concepts/slug paths correctly| Improvement | Effort | Impact |
|---|---|---|
| Verify reference title appears on the page | Medium — requires fetch + text search | High — eliminates reference hallucinations |
| Cross-check text figures against cited references | High — requires semantic understanding | High — eliminates fabricated statistics |
| Add prompt warning about prices and volatile data | Low — prompt change | Medium — reduces stale data errors |
<title> and compare against the cited title. If they don't match, flag as suspicious.Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.
Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.
Tools using LLMs to help developers write, understand, debug, and refactor code, from autocomplete to agents that implement complete features.