The More Sources an AI Research Agent Reads, the Less Accurately It Represents Yours

More research should mean more accuracy. That is the assumption. A deep research agent retrieving 150 sources is — intuitively — better informed than one retrieving two. The preprint "Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents" (arXiv:2605.06635, May 2026) tests that assumption directly. The finding runs the other way.

Factual accuracy degrades as search depth increases. Across 14 models, the average factual accuracy drop from 2 tool calls to 150 is approximately 42 percentage points. The study names the mechanism: information overload. When a deep research agent retrieves many sources, the volume of content exceeds the model's capacity to maintain fine-grained attribution. Facts that are accurately supported at low depth get blended, paraphrased, or attributed incorrectly as the source pool grows.

For any organisation that produces high-precision factual content — policy analysis, research summaries, data journalism — this finding describes a specific failure mode that currently has no name in most content strategy vocabularies. It does now. We think it is the most practically significant finding in the deep research literature this year.

Evidence tier: Preprint — not yet peer-reviewed. The figures cited here (42-percentage-point average degradation, 79% → 17% trajectory for GPT-5.4) are directional findings from a controlled study, not stable properties of any named platform. Models are updated continuously; measured behaviour may not reflect current production systems.

What the study measured

The paper evaluates source attribution quality in LLM deep research agents — systems that conduct multi-step research by executing tool calls (web searches, document retrievals) before generating a final report. The core design question: does the accuracy with which an agent represents a given source's content degrade as the number of sources retrieved increases?

The researchers tested 14 models across search depths ranging from 2 to 150 tool calls — the full range from shallow to maximum-depth research. For each model and each depth, they measured factual accuracy: the proportion of citation-claim pairs where the specific fact, number, date, or assertion in the generated text is accurately supported by the retrieved source.

Baseline factual accuracy — at 2 tool calls — ranged from 39% to 77% across the 14 models. That range itself is a finding worth holding. Our read: even before information overload enters the picture, these systems are not reliably accurate. At the lowest end, a model at minimal search depth is accurately representing fewer than 4 in 10 specific factual claims. At the top end, 77% accuracy at 2 sources — before any depth scaling — still means nearly 1 in 4 factual claims is inaccurate at the most favourable condition the study tested.

The information overload trajectory

The average 42-percentage-point factual accuracy degradation from 2 to 150 tool calls is the study's headline figure. The GPT-5.4 trajectory is the sharpest case in the data: 79% factual accuracy at 2 tool calls; 17% at 150.

That is not a gradual decline. It is a collapse.

The mechanism the paper identifies is information overload. This is not a bug in a specific model. It is a structural property of how current deep research agents process large retrieved source pools. When the volume of retrieved content grows large enough, the model can no longer track which specific claim originated in which specific source. Claims get merged across sources. Qualifications get dropped. Numbers from one source get attached to claims from another. The agent cites a source — correctly, in the sense that the source was retrieved — but the claim attributed to it no longer matches what the source actually says.

The result: your content is cited. What is attributed to your content is not what your content says.

What this means for publishers whose content gets retrieved

The study's creator-side implication is about awareness, not optimisation. There is no direct lever here.

You cannot control how many sources a deep research agent retrieves when it processes a query. Search depth is set by the user or the platform — 10 tool calls, 50, 150. When a research agent is running at high depth and your content is one of the sources retrieved, the probability that the agent accurately represents what your content says is substantially lower than it would be if your content were one of two sources. That probability differential is approximately 42 percentage points on average across the 14 models tested. You have no input into which side of that differential you land on for any given query.

The implication is a shift in how high-precision publishers should think about AI citation at scale. Being retrieved is not the same as being accurately represented. Being cited is not the same as being accurately quoted. The E.05 post in this series (H5 — Verified Misguidance) finds that 30.6% of citations across commercial search-augmented systems structurally distort their sources — a related but distinct finding about a different failure mode. The H6 finding adds a second lens: even at the level of specific factual claims, deep research synthesis becomes substantially less accurate at scale.

Two independent studies. Two different measurement constructs. Both pointing at the same direction: citation does not equal accurate representation.

What this study does not tell us

The study establishes the information overload effect but does not identify which content properties help a source maintain accuracy under high-depth synthesis. The creator-side question — what should I do about this? — is not answered by this paper. There is no tested intervention here.

Structural precision in content — clear claims, dated facts, labelled assertions, explicit source attribution — is a reasonable mitigation. A claim that is precisely stated, with a clear attribution and a specific figure, is harder to blend with other sources than a claim that is approximate, unattributed, or expressed in general terms. But this is a reasonable inference from how information overload works, not a finding the paper tests. The study does not include a structural precision condition. Do not read this post as establishing that specific content changes will solve the information overload problem.

The study also does not identify a safe search depth threshold. The trajectory is degrading across the full range tested — the data does not produce a depth ceiling below which attribution is reliable. This means publishers cannot rely on "shallow" research behaviour as a protection; they do not know the depth at which their content will be processed in any given query.

Finally: the 14 models tested may have been updated since the study's measurement window. Preprint findings represent model behaviour at time of measurement. The directional finding — that information overload degrades attribution accuracy — is theoretically grounded enough to treat as likely durable. The specific magnitudes (42 percentage points average, 79% → 17% for GPT-5.4) may shift with model updates.

The psytable tools this finding relates to

We do not currently have a tool that directly tests deep research accuracy — that is an honest statement and the correct place to start. What the Evidence Density Score surfaces is whether your content has the statistical and evidential properties associated with citation selection and absorption: specific figures, dated claims, attributed sources, structural clarity. These are the properties that make individual claims within your content traceable and distinguishable — the same properties that, by reasonable inference, make it harder for an information-overloaded model to blend your claims with another source's.

The connection is indirect. The information overload mechanism identified in this paper is about what happens to attribution when source volume exceeds model capacity. The Evidence Density Score addresses what your content looks like before it enters that process — whether it has the precision properties that could, in principle, provide resistance. Whether those properties produce measurable resistance under deep research conditions is a research question the field has not yet answered.

If this post generates strong Segment B engagement — from content directors at organisations publishing high-precision factual content — that signal will inform whether to develop a dedicated tool for deep research accuracy evaluation. The current post is partly a test of that appetite.

References

"Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents" (2026). Preprint. arXiv: 2605.06635

"Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs" (2026). Preprint. arXiv: 2605.28565

Check your content's evidential precision.

The Evidence Density Score measures whether your content has the specific, attributed, structurally clear claims that distinguish what you say from what adjacent sources say — relevant whether a model retrieves 2 sources or 150. No signup. No gates.

Try Evidence Density Score → Related: H5 — Verified Misguidance