A peer-reviewed study published at ACM SIGIR-AP 2025 found that prepending a more recent publication date to a passage of content — without changing a single word of that content — was enough to reverse AI preference between two equally relevant passages by up to 25%. Across seven models tested. In a controlled experiment where relevance was held constant and only the date was varied.
If you have been optimising your content for AI citation — working on evidence density, structural richness, source attribution, readability — this finding introduces a variable those properties do not address. A piece with a lower structural score and a newer date can beat a piece with a higher structural score and an older one. The date is a metadata signal, and in this study it overrides content quality under controlled conditions.
This post is a close reading of Fang et al. 2025.1 It examines what the study tested, what the findings mean quantitatively, and what they mean for the practical decision most content practitioners are already weighing: update an existing post, or create something new.
What the Aggarwal and Zhang studies left open
The first two posts in this series translate findings from Aggarwal et al. 2024 and Zhang et al. 2026. Both studies focus on content properties — the measurable characteristics of a document's text and structure. Statistics presence, source attribution, heading density, readability, definitional language, paragraph density. These are things you put inside the document.
The Fang et al. study asks a different question: does a property outside the document — specifically, when the document appears to have been published — affect whether AI systems prefer it?
The answer is yes. And the effect is large enough to matter strategically.
What Fang et al. tested
The study was designed to isolate the date signal as a single variable. The researchers took passages from two standard retrieval benchmark collections — the TREC Deep Learning 2021 and 2022 test sets — and prepended artificial publication dates to those passages. The passages themselves were unchanged. The relevance scores assigned by human assessors for those passages were unchanged. The only difference between conditions was the date label.
They then ran two types of reranking experiments — listwise (ranking a list of passages in order of relevance) and pairwise (choosing the preferred passage between two) — across seven models across three model families (GPT, LLaMA, and Qwen), ranging from 7B to 70B parameters.2
This design is worth understanding because it is what gives the findings their evidential weight. The study is not measuring correlation between content age and AI performance in production search. It is measuring the effect of injecting a date label into an otherwise unchanged passage, under controlled conditions where everything else is held constant. The causal question is narrowed: does the date label itself change ranking behaviour? The study's design allows a cleaner answer to that question than a naturalistic corpus study would.
The findings
Preference reversal in pairwise experiments. In pairwise comparisons — where the model was asked to choose the more relevant of two passages — the study found that preference between two passages of identical relevance could be reversed by up to 25% simply by injecting a more recent date into one of them. The content was the same. The assessed relevance was the same. The date label shifted which passage the model preferred by up to a quarter of cases. The 25% and 95-position figures are ceiling values — the study reports these as maximum effects; the distribution of effects across experimental trials is not separately reported in the available findings.
Individual item displacement. In listwise experiments, individual passages moved by as many as 95 rank positions under date injection. A passage that would have ranked near the bottom of a ten-item list, given a recent date, could be promoted to near the top — and vice versa.
Mean year shift. The mean publication year of the top-10 results shifted forward by up to 4.78 years when date labels were injected. That is the average effect across the top-10 — not the movement of a single outlier.
Consistent direction across all seven models. Fresh passages — those with more recent date labels — were consistently promoted across all seven models in the study. None of the seven models tested was immune to the bias.
Model size attenuates the effect but does not eliminate it. Larger models — LLaMA-3-70B and GPT-4o versus their smaller counterparts — showed a smaller recency bias. The effect is not uniform across the model landscape. But even the largest models in the study retained the bias. Attenuation, not elimination.
What these numbers mean for how you use them
These figures come from a controlled benchmark experiment, not from live AI search in production. The passages are from TREC retrieval collections; the dates are artificial labels prepended in the experiment; the queries are benchmark queries. Real-world content operates in a more complex environment — with mixed date signals, platform-specific retrieval logic, and variation in how models weight date information depending on query type.
The 25% preference reversal and the 95-rank shift are experimental effect sizes. They tell you the bias exists, that it is large enough to be consequential, and that it operates consistently across a diverse set of models. They do not tell you that your specific post will move by 95 positions if you update its date, or that a competitor with a newer timestamp will beat you 25% of the time. The mechanism the study describes — that LLMs have absorbed recency preferences from training data and apply them implicitly in retrieval contexts — is documented but not fully resolved. The study identifies and quantifies the bias; it does not establish the full causal chain at the model-architecture level.
Use the findings as directional signals: there is peer-reviewed evidence that LLM rerankers treat date as a positive quality signal, that this effect is consistent across major model families, and that it is large enough to override content-quality differences in controlled conditions. Those are the claims the evidence supports.
The temporal layer and the structural layer
The Aggarwal and Zhang research establishes that certain structural and linguistic content properties — statistics, source attribution, heading density, readability, definitional clarity — are associated with higher AI citation and absorption probability. That evidence is real and actionable.
Fang et al. adds a finding that sits in a different category. Publication date is not a content property. You do not write it into the document. It is a metadata signal — and the study finds it can override content-property advantages under controlled conditions.
This means the two layers are complementary, not competing. Structural optimisation remains relevant: a piece with high evidence density and a current date is better positioned than one with low evidence density and a current date. But a piece with strong structural properties and an outdated date may lose to a weaker piece with a newer one. Structural quality is necessary but no longer sufficient as a complete optimisation picture.
The practical implication this creates is specific: the update-versus-create decision now has peer-reviewed support for including recency as a factor.
The update-versus-create decision
Content practitioners making editorial calendar decisions have always had to weigh whether to refresh existing content or produce new pieces. The traditional calculus has involved factors like existing rankings, link equity, URL authority, content quality gaps, and keyword targeting. The AI citation dimension adds a variable most existing frameworks do not account for.
Fang et al.'s findings suggest that when existing content has strong structural properties — evidence density, source attribution, heading structure, readability — keeping its date signal current may preserve or extend its AI citation probability, even without substantive content changes. The date label alone was sufficient to shift preference in the controlled experiment; updating a post's publication date (with or without content changes) is a mechanism by which this signal could be refreshed in practice.
The study's evidence does not extend to saying that a date-only update with no content improvement is the right operational move — that is beyond what the controlled experiment can tell us. What it does support: recency is a real factor in AI reranking, the effect size is large enough to be worth treating seriously, and existing high-quality content may have more value than its apparent age suggests if its date signal is refreshed alongside the content.
For content that is already structurally well-optimised, the update calculus now includes a recency dimension. For content that is not structurally well-optimised, updating the date without improving the content addresses the temporal layer but leaves the structural layer unaddressed — and the structural layer remains the foundation.
About the tools referenced in this post. The Absorption Analyser and Evidence Density Score are Psytable tools that measure the structural and linguistic content properties associated with AI citation and absorption behaviour — statistics presence, source attribution, heading density, readability, paragraph density, definitional and comparative language. These tools measure what this post calls the structural layer: the content properties identified by the Aggarwal and Zhang research. They do not currently surface a recency signal or score publication date as a dimension. References to these tools in the section below refer to these specific Psytable outputs.
What to do differently
The Fang et al. findings support three concrete adjustments to how you treat existing content in the context of AI citation optimisation.
First: audit your high-quality existing content for date currency. Use the Absorption Analyser and Evidence Density Score to identify posts with strong structural scores — high evidence density, solid source attribution, good heading and paragraph structure. These are the posts where the Fang et al. finding is most directly relevant. A post that is performing well on structural grounds and has an outdated date stamp is the clearest candidate for a refresh that addresses the temporal layer. Note: this extrapolates from a controlled benchmark experiment — no live production test has been published confirming that a CMS date update produces the same directional effect as artificially prepended date strings.
Second: when you update, update substantively. The study's evidence is that the date signal matters; it does not tell you that a cosmetic date change without content improvement is sufficient or strategically appropriate. A date refresh on content that was already strong is a different move than a date refresh on content with unresolved structural deficiencies. The structural layer is still the foundation. Update the date as part of a genuine refresh — add new evidence, update figures, add source attribution where it is missing — not as a standalone tactic. This is also consistent with major search engine guidelines — cosmetic date refreshes on unchanged content may be treated as deceptive by Google and Bing, and the study itself does not support that approach.
Third: factor recency into the update-versus-create calculus explicitly. If an existing post scores well on structural dimensions but was published more than a year or two ago, the recency signal may be working against it in AI reranking contexts. That is now a documented, peer-reviewed consideration. It belongs in the editorial decision, alongside the traditional factors.
The limits of this research
The study uses artificial date injection on controlled benchmark collections — not live AI search queries against real-world content at scale. The passages are from TREC test sets, not from production web content in the domains you are writing in. The dates are labels prepended to text; they may be processed differently from publication date metadata embedded in real web content at the platform or retrieval level.
The mechanism is identified but not fully resolved. The study's hypothesis — that LLMs absorb recency bias from training data that skews toward recent, frequently updated information — is documented and supported by the experimental findings. But the study acknowledges the mechanism is not comprehensively characterised. Why the bias exists in the training process, and whether it can be fully eliminated through instruction tuning or prompting, are open questions.
The seven models tested are a meaningful sample of the current model landscape — covering OpenAI, Meta, and Alibaba model families across two scale tiers each. They are not an exhaustive census. Models not tested may behave differently. And model behaviour evolves: findings from a study published in late 2025 reflect model behaviour as it was measurable at that time.
Apply the same methodological discipline here that the Aggarwal and Zhang posts apply: these are peer-reviewed, controlled experimental findings from a named ACM venue. The directional claims — that LLM rerankers exhibit consistent recency bias, that the effect is large enough to reverse preference, that larger models attenuate but do not eliminate it — are well-supported. The production generalisation requires the same care you would apply to any controlled finding.
Measure the layer this research is about
The structural layer — evidence density, source attribution, heading structure, readability — is what the Psytable tools measure, and it remains the foundation. Fang et al. adds the temporal layer on top. A post that scores well on structural dimensions and maintains a current date signal is now the more complete picture of AI citation optimisation that the research supports.
Run your existing content through the Evidence Density Score. Identify the posts that are already performing well structurally. Those are the candidates where the Fang et al. finding is most directly actionable — where refreshing the date, alongside any available content improvements, addresses both layers at once.
Measure the structural layer first.
The Evidence Density Score applies the Aggarwal et al. findings — measuring statistics, source attribution, readability, and structure in a single 0–100 score. Identify which posts are already structurally strong before acting on the recency signal.
Psytable — AI citation research for content practitioners.
References
- Fang, Tao, Chen, Chang, and Sakai (2025), "Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking," published in the Proceedings of the 2025 Annual International ACM SIGIR-AP Conference on Research and Development in Information Retrieval. arXiv: 2509.11353 ↩
- Full model list: GPT-3.5-turbo, GPT-4o, GPT-4, LLaMA-3-8B, LLaMA-3-70B, Qwen-2.5-7B, and Qwen-2.5-72B. ↩