Why Statistics in Your Content Are Associated With Higher AI Citation Probability

The most important number in AI content optimisation comes from a 2024 KDD paper by Aggarwal et al. It is 31% — the approximate increase in AI citation probability associated with including statistics in your content. The Aggarwal study establishes a correlation, not a causal mechanism — but the signal is strong enough to act on. The study does not report the absolute baseline citation probability, so this figure represents directional magnitude — not a starting point you can calculate from. This essay is a close reading of that study, and what it means practically for the content you're writing today.

What Aggarwal et al. 2024 studied

The Aggarwal et al. 2024 paper was published at KDD — the ACM SIGKDD Conference on Knowledge Discovery and Data Mining — one of the top venues for data science and machine learning research. The study set out to move beyond intuition about what AI systems "prefer" and establish quantifiable correlations between specific content properties and citation probability.

The researchers analysed a large corpus of content to measure which content characteristics were associated with being cited by AI systems, versus content that was retrieved and absorbed but not directly cited. The study produced a set of correlation findings across multiple content dimensions.

What "citation probability" means in this context

Citation probability — as the Aggarwal study uses the term — measures the likelihood that an AI system references a piece of content as a named source in its output. This is distinct from absorption, which is when an AI retrieves and uses information from a piece of content without naming it. A citation requires the AI to surface the source explicitly.

This metric is also distinct from user engagement metrics like time on page, conversion, or click-through rate. A piece of content can be cited by an AI system without any human reader ever visiting it. A citation is a machine selection event — not a human one. The exact magnitude will vary by query type, content length, and AI platform.

The source attribution compounding effect

The Aggarwal study found a second, closely related signal. Adding an explicit source reference alongside a statistic was associated with increased citation probability of approximately 30% in their analysis. The same framing applies here as to the headline figure: this is a correlational finding, not an established mechanism. But the practical implication is directional and specific.

An anonymous statistic — "studies show that X" — carries less citation signal than an attributed one: "A 2024 study by Aggarwal et al. found that X." The attribution does not just add credibility to the reader; it appears to provide AI systems with an additional retrieval signal. This may compound with the statistical density finding — though the studies measure each predictor independently, not their interaction. Naming the source, venue, and year alongside each statistic is the operational form of this finding.

The readability finding

A second major finding from Aggarwal et al. 2024 concerns readability. Content readable at Flesch-Kincaid grade 8–10 was associated with meaningfully higher AI extraction rates than content at grade 12+ or grade 14+.

This finding is counterintuitive to writers trained on academic or professional publishing conventions, where complexity is often associated with authority. For AI extraction, the inverse appears to be true: readable, accessible content is extracted more reliably than dense, complex content at equivalent information quality. The "equivalent information quality" condition is the Aggarwal study's qualifier — practitioners applying grade 8–10 targets to technical content should monitor whether readability gains come at the cost of information quality.

The Readability Analyser surfaces your current grade level and flags content that sits significantly above the 10-grade threshold. If your content is sitting at grade 13 or 14, you can target specific passages for simplification rather than rewriting the whole piece. The passive voice rate and sentence length distribution tell you where to make targeted edits.

What this means for content you're writing today

The Aggarwal findings translate into three concrete priorities for any piece of content you're producing.

First: include specific statistical claims. A statistical claim is any sentence that contains a specific number tied to a measurement — a percentage, a ratio, a count, a study size, a date-stamped finding. "Studies show it helps" is not a statistical claim. "A 2024 KDD study found a 31% association between statistics and citation probability" is.

The Aggarwal findings point toward higher statistical density as a positive signal — include specific figures, sourced data points, and referenced studies where your argument supports it. The research does not specify a count threshold.

When we tested our own corpus of 36 published GEO (Generative Engine Optimisation) and AI-search articles — loading all pieces into a controlled NotebookLM environment and running eight standardised queries — we could not isolate the three-per-1,000-words threshold directly: every article in the corpus already exceeded it by a wide margin (median: 22.9 statistical claims per 1,000 words). What the data did show is a positive association between statistical density and citation rate within that corpus (Spearman rho = 0.47, a rank-order correlation coefficient — a value of 0.47 indicates a moderate positive association). Articles in the lowest density quartile (3.7–12.5 claims per 1,000 words) were cited in 33% of queries, compared with 89% for the highest-density quartile (42.8 or more claims per 1,000 words). The directional signal holds. The specific floor of three per 1,000 words is the author's operational starting point, not a figure the data either confirms or refutes. After controlling for covariates, the partial correlation is 0.32 — directional, but at the boundary of conventional statistical significance at this sample size.

Second: name the source of your statistical claims. Anonymous statistics have lower citation value than attributed ones. Name the study, the venue, the year. This may compound with the statistical density finding — though the studies measure each predictor independently, not their interaction.

Third: write at grade 8–10 readability. Run your content through the Readability Analyser after drafting. If you are writing at grade 12 or above, target the longest, most complex sentences first.

About the tools referenced in this post

The Readability Analyser and the Evidence Density Score are features of Psytable — a tool built to measure the specific content properties associated with AI citation and absorption. The Readability Analyser calculates your content's Flesch-Kincaid grade level and identifies passages above the target threshold. The Evidence Density Score measures statistics presence, source attribution, readability, and structural properties in a single 0–100 score, weighted by the signal strengths identified in the Aggarwal and related studies. References to these tools in the sections above and below refer to these specific Psytable outputs.

One number to remember

A peer-reviewed study published at KDD 2024 found that including statistics in content was associated with approximately 31% higher AI citation probability — the strongest single signal in the Evidence Density Score and the most robust finding available in the GEO research landscape, because it is the only citation signal corroborated by multiple independent studies.

The limits of this research

The study measures correlation, not causation. The 31% figure is the association Aggarwal et al. found between statistics presence and citation probability in their dataset — it does not establish that adding statistics to your content will mechanically produce a 31% citation lift. The mechanism is not definitively established.

The 31% figure should be treated as a directional anchor — a reliable signal, not a fixed conversion rate. It tells you that statistics-dense, source-attributed, readable content performs better on this measure than content without those properties. It does not tell you by how much your specific content will improve, on which platforms, or for which query types.

Citation probability also captures a specific stage of the AI content funnel — it primarily captures citation selection, not absorption, not user engagement, not conversion. A piece of content optimised for AI citation may perform differently on traditional search traffic metrics.

AI citation behaviour evolves as models are updated. The Aggarwal findings reflect model behaviour as measured at the time of the study. Treat these signals as current best evidence, not permanent algorithm rules.

Measure your evidence density

The Evidence Density Score applies the Aggarwal findings directly — measuring statistics, quotations, readability, and structure in a single 0–100 score. The "Where to focus first" panel shows the highest-leverage changes for your specific piece of content.

Measure your evidence density now.

The Evidence Density Score applies the Aggarwal et al. findings directly — measuring statistics, quotations, readability, and structure in a single 0–100 score. Peer-reviewed source. No signup.

Try Evidence Density Score → Next: Zhang et al. 2026 explained