Zhang et al. 2026 Explained

A research team published a preprint in 2026 analyzing 21,143 citations across three major AI platforms — ChatGPT, Google AI Overviews, and Perplexity. They identified six measurable structural dimensions that separated high-influence pages (pages that were visibly absorbed into AI-generated answers) from low-influence ones. The differences were large. In some cases, enormous.

This post is an explanation of what they found — what the six properties are, what the ratios mean, and what the preprint status means for how confidently you should act on it.

The study at a glance

Zhang et al. 2026 is a preprint — it has not yet completed peer review. (Zhang et al. 2026 preprint — arXiv identifier not confirmed in the materials available at time of writing; readers should search arXiv.org for the full citation to verify current status.) The dataset is large: 21,143 citations across three platforms. But the methodology has not been independently validated.

This matters, and we'll address it directly. But the study is large enough, and the findings consistent enough with prior work in this area, that the directional signals are worth understanding.

We plan to check the preprint's peer-review status in August 2026.

The six properties — and the ratios

The Zhang et al. analysis compared high-influence pages to low-influence pages across six measurable dimensions. Here is what they found:

Word count. High-influence pages were on average 11.44 times longer than low-influence pages. An alternative interpretation is that longer pages contain more text and so carry a higher statistical probability of matching any given query — the study does not rule this out. Taking the finding at directional face value: length is associated with higher AI absorption probability in this dataset.

Heading density. High-influence pages had 12.50 times more headings than low-influence pages. Heading density was among the properties most strongly associated with high-influence pages.

Paragraph density. High-influence pages had 5.69 times more paragraphs. Content divided into bounded paragraphs correlated with higher influence in Zhang's corpus.

Definitional language. Pages containing explicit definitional sentences — sentences that state what something is — were associated with higher absorption probability.

Comparative language. Pages using comparative constructions showed approximately 55% higher absorption probability. The study does not report the absolute baseline absorption rate — this figure represents directional magnitude, not a starting point the reader can calculate from.

Statistics presence. Pages with statistics showed approximately 61% higher absorption probability. The same baseline caveat applies here — this figure is directional, not calculable from a stated starting point.

Definitional and comparative language

Beyond structural properties, Zhang et al. found that two language patterns were associated with higher absorption probability.

A definitional sentence explicitly states what something is. "Evidence density is the concentration of verifiable, attributed claims per unit of text" is a definitional sentence. "This approach improves your content" is not.

A comparative sentence positions a concept relative to another concept. Constructions using "compared to," "unlike," "whereas," and "in contrast to" signal comparative structure. These are not rhetorical flourishes — they are the kind of sentences that give a clear positional claim the reader (and an AI system extracting from the text) can evaluate discretely.

The approximately 55% higher absorption probability figure for comparative sentences is directional. It is drawn from the Zhang et al. analysis across the full 21,143-citation dataset.

Statistics presence — the cross-study corroboration

Statistics presence showed approximately 61% higher absorption probability in the Zhang analysis. This is a directional signal from a preprint.

What makes this finding worth separate attention is the corroboration. The Aggarwal et al. 2024 study — peer-reviewed, published at KDD — found that statistics presence was associated with approximately +31% higher citation probability in AI systems. That study measured citation, not absorption; the methodologies differ. But both studies, independently, flag the same signal.

Statistics presence is the only absorption signal in the Zhang data that is corroborated by peer-reviewed research. Every other property in this post is sourced from Zhang's preprint alone. This one has external confirmation.

What "preprint" actually means for how you use this

A preprint is a research paper that has been shared publicly before completing peer review. Peer review is a formal process in which independent experts in the field evaluate the methodology, analysis, and conclusions. A preprint has not cleared that process.

This is normal in fast-moving research fields. Preprints are published because the findings are useful to practitioners before the review cycle completes. But it means the methodology has not been independently validated. Findings may change, be qualified, or in rare cases be retracted once peer review is complete.

For the Zhang et al. findings, this means: treat the ratios as directional signals, not established facts. The 11.44x length finding and the 12.50x heading finding are large enough to be worth acting on directionally. They are not settled algorithmic rules.

The one exception to that caution is statistics presence — which, as noted above, is corroborated by peer-reviewed evidence from Aggarwal et al. 2024. That signal you can hold with more confidence.

About the tools referenced in this post

The Absorption Analyser is a Psytable tool that scores a piece of content against the six structural and linguistic dimensions identified in the Zhang et al. analysis. It produces a single absorption probability signal per piece of content. The tool's focus panel puts statistics presence — corroborated by Aggarwal et al. 2024 — first, above the purely Zhang-sourced signals, reflecting the difference in evidential weight between a peer-reviewed finding and a preprint finding. References to the Absorption Analyser in the section below refer to this output.

What the findings mean practically

If you take the Zhang et al. findings at directional face value, the practical translation is straightforward. These directives are sourced from Zhang's preprint only, except where noted.

Write longer. Not padded — substantive. High-influence pages in Zhang's corpus were not long because they repeated themselves. The length finding is consistent with the argument that depth of coverage, not volume of words, is the underlying variable. Word count is the proxy Zhang measured.

Use more headings. Not decorative headings — structural ones. The Zhang study measured heading count, not HTML tag hierarchy. Apply heading structure wherever it creates a logical break in the content: when a new concept is introduced, when the argument pivots, when a new property is being described. More headings, meaningfully placed, is the research-grounded directive. The specific HTML hierarchy you use to implement that is an operational decision the research does not prescribe.

We ran our own corpus test: 36 published GEO (Generative Engine Optimisation) and AI-search articles, loaded into a controlled NotebookLM environment, queried with eight standardised templates. The headline result: heading density per 1,000 words — the most direct interpretation of a "one H2 per major idea" rule — showed essentially no correlation with citation rate in the corpus (Spearman rho = 0.08, a rank-order correlation coefficient). Raw heading count, not density, was the relevant variable.

Articles above the median H3 count (more than 15 H3 sub-headings) were cited in 83% of queries at an average rate of 0.20 citations per query.
Articles with 15 or fewer H3s were cited in 50% of queries at 0.08 citations per query.

That is a 141% relative difference. This held after controlling for article length (partial correlation = 0.32, a correlation after controlling for article length). At this sample size (n=36), a partial correlation of 0.32 is directional but falls just below the conventional significance threshold — treat the H3 finding as orientation, not a validated rule. The practical implication: the rule is not that each idea gets one heading. It is that more granular sub-structure — a higher absolute number of H3 sections throughout the piece — associates with more frequent citation. The H2/H3 hierarchy direction (sub-points within sections) is supported by a modest independent correlation (H3/H2 ratio vs citation rate: Spearman rho = 0.31). Both the Zhang findings and our corpus test point the same way: sub-structure at scale, not evenly-spaced section breaks, is what appears to matter.

Define your terms. Write sentences that explicitly state what something is. This is not just an SEO question — it is an absorption question. This may be because definitional sentences offer AI systems a discrete fact to extract — though the study measures correlation, not the mechanism behind it.

Make comparisons explicit. Use "compared to," "unlike," "whereas," and "in contrast to." Comparative structure is associated with higher absorption probability in Zhang's data.

Include statistics. This is the one directive here with cross-study corroboration. Statistics presence is associated with higher absorption probability in Zhang's preprint and higher citation probability in Aggarwal's peer-reviewed study. Both findings are directional, not causal. Both point the same direction.

These are directional — sourced from a preprint only (except statistics presence). Act on them as signals, not as guaranteed levers.

Measure your absorption signals.

The Absorption Analyser scores all six Zhang et al. dimensions — with evidence tiers clearly labelled and a prioritised focus panel showing what to fix first.

Try Absorption Analyser → Next: Evidence vs. keyword density