How We Built the Tools — The Research Behind Psytable's GEO Diagnostics

The tools came first. We built the Absorption Analyser, Evidence Density Score, and Platform Variance on the best evidence available at the time — the Aggarwal et al. 2024 peer-reviewed study, early platform observation data, and the selection/retrieval framing that dominated early GEO thinking. Then we ran the research series.

Six papers later, the picture is more complicated. And more honest.

This post is a direct account of what that research confirmed, what it forced us to change, and what it revealed we're not yet measuring. The tools are calibrated against the best available evidence. That's not the same as calibrated against settled science.

What the research confirmed

The Absorption Analyser's structural tiers — validated by H1

The Absorption Analyser scores six content properties: word count, heading count, paragraph count, definitional language, comparative language, and statistics presence. When we designed those tiers, the primary source was the Aggarwal et al. 2024 peer-reviewed study — which found that statistics presence, in particular, was among the strongest citation predictors in a controlled experiment across AI systems.

The H1 paper (Yu et al. 2026, arXiv: 2603.29979) arrived and confirmed the structural tiers from a different angle. Yu et al. ran a controlled experiment holding semantic content constant and varying only document structure — macro-architecture, information chunking, and visual emphasis. The result: a mean 17.3% improvement in citation rates across six generative engines from structural changes alone.

The tiers we score are the tiers Yu et al. found to matter. That's directional validation. Yu et al. is a preprint — not yet peer-reviewed — so we're treating it as confirmation that we're measuring in the right direction, not proof that the weights are exactly right.

The selection/absorption split — validated by H2

The H2 paper (Zhang et al. 2026, arXiv: 2604.25707) introduced the distinction between citation selection and citation absorption. Selection is the platform retrieving and choosing your page. Absorption is your page's language and evidence actually appearing in the answer.

A page can be selected without being absorbed. That's an important distinction — and it's the one the Absorption Analyser v2.0 now surfaces explicitly.

The v2.0 update, shipped as part of WS2b, replaced the single detection panel with a two-panel layout. Selection Signals (word count, heading count, paragraph count) are grouped separately from Absorption Signals (definitional content, comparative content, statistics presence). Each panel carries its own source label and evidence tier. The selection panel draws primarily from Aggarwal et al. 2024 — peer-reviewed, measures citation probability. The absorption panel draws primarily from Zhang et al. 2026 — preprint, measures absorption uplift. The distinction matters, and the interface now makes it visible.

The scoring weights were also recalibrated to reflect Zhang's uplift hierarchy. Statistics and numeric evidence moved from 10/70 to 22/100 — the highest-weighted dimension — reflecting the 76.88% code/numeric and 61.55% statistics uplift figures Zhang et al. reported. Definitional content moved to 18/100; comparative content to 16/100. The overall score is now out of 100.

One line appears in the results panel on every analysis run: "Scoring weights reflect this tool's interpretation of H2 feature uplift data — not a direct port of the research model." That's the honest disclosure. We translated the research. We didn't replicate it.

Evidence Density Score — validated by Aggarwal

The Evidence Density Score measures statistics, definitions, and source citations per unit of text. The Aggarwal et al. 2024 study — peer-reviewed, published at KDD — found those content properties to be among the strongest citation probability predictors in a controlled experiment. Statistics presence alone was associated with approximately +31% higher citation probability.

That's the peer-reviewed foundation the Evidence Density Score sits on. The Aggarwal finding is the strongest evidential support we have for any tool in the suite. Interventional design, not observational. That's a meaningful distinction in this field.

What the research complicated

H4 — the tools don't know what kind of content you wrote

The H4 paper (FeatGEO, arXiv: 2604.19113) found something we hadn't built for: token-level GEO heuristics that improve citation rates on human-written content actively degrade citation rates on LLM-generated content.

The Evidence Density Score carries a caveat added during the WS2a update: "LLM-generated content: the relationship between evidence density and citation probability may differ from the pattern identified in human-written content research. The scoring framework is calibrated against human-written content studies." That caveat is now in the RL-012 scoring note.

But the tool doesn't detect content type. It can't tell whether you're scoring a blog post you wrote or one that was generated. That's a real limitation. If you're running LLM-generated content through the Evidence Density Score, the score is based on research that may not apply to your situation.

We don't have a content-type detection layer. That's an honest gap in the current suite.

H3 — the tools score against a universal framework

The AgentGEO paper (arXiv: 2603.09296) found that generic optimisation can harm niche long-tail content. The finding is specific: some content faces challenges that broad structural optimisation cannot address, and applying the same ruleset to niche content that works on general content can actively lower citation probability.

The Absorption Analyser, Evidence Density Score, and Platform Variance all score content against the same framework regardless of your niche, audience, or query type. A niche operator producing technical content for a narrow specialist audience is scored the same way as a general-interest publisher. The tools don't adapt to context.

We know the AgentGEO finding. We're telling you about it directly. The diagnostic approach we're building toward is closer to what AgentGEO describes — fault-mode identification, not blanket optimisation — but the current tools don't yet implement it. Score with that in mind.

What the research exposed — the gaps we don't have tools for

H5 — being cited doesn't mean being cited accurately

The H5 paper (arXiv: 2605.28565 — "Verified Misguidance") measured what happens when you are cited. The finding: in 30.6% of cases, the cited page's content does not support, or actively distorts, the claim it was cited to support. The AI attributed the source. The source contradicted, misrepresented, or failed to back the claim.

The Absorption Analyser measures absorption readiness. The Evidence Density Score measures citation probability predictors. Neither tool measures what happens to your content after it's cited. We don't have a fidelity audit tool — one that checks whether a page's content actually supports the claims being attributed to it, or flags the gap between what a page says and how an AI system is likely to represent it.

That's a tool that doesn't exist anywhere in the current suite. Building it is a different problem from what we're solving today. But the H5 paper made it visible, and we're not going to pretend the gap isn't there.

H6 — deep research accuracy degrades at scale

The H6 paper (arXiv: 2605.06635 — "Cited but Not Verified") examined citation accuracy in LLM deep research agents — systems that synthesise findings across many sources simultaneously. The finding: accuracy degrades as the number of sources processed increases. More sources, more errors.

This is not a content quality problem. It's a structural problem in how deep research agents process large citation volumes. We don't have a tool that assesses how a page will perform in deep research contexts — where the agent is pulling from dozens of sources at once and the accuracy pressure is highest. That gap is structural, not something a content-side tool can straightforwardly close.

What we can do is be honest that our tools are calibrated for standard citation contexts. Deep research accuracy at scale is an open problem in the field — not one we've addressed, and not one the research has solved either.

Platform Variance — what WS2a changed

Platform Variance surfaces citation behaviour differences across ChatGPT, Google AI Overviews, and Perplexity. The original framing described this as a retrieval difference — why one platform picks up your content and another doesn't.

The Zhang et al. H2 paper gave us more precise language. The difference between platforms is not just retrieval. It's selection versus absorption — and the two phenomena behave differently across platforms. ChatGPT cites fewer pages but shows higher per-page absorption influence. Perplexity cites more broadly. Google AI Overviews sits between them.

The WS2a update added the selection/absorption mechanism explanation to the Platform Variance tool. The tool now describes cross-platform differences in both selection probability and absorption depth terms, rather than framing everything as a retrieval question.

One peer-reviewed paper. Six preprints. Our interpretation of both.

The H1–H6 research series spans one peer-reviewed paper and six preprints. Aggarwal et al. 2024 (KDD, peer-reviewed) is the peer-reviewed anchor for the series. The rest — Yu et al., Zhang et al., AgentGEO, FeatGEO, Verified Misguidance, Cited but Not Verified — are preprints, not yet through independent review.

The calibrations in the Absorption Analyser, Evidence Density Score, and Platform Variance reflect our interpretation of that evidence. Not a direct port of research models. Not a claim that the weights are exactly right. The scoring rationale note in the Absorption Analyser says this explicitly because it's true of the whole suite.

The tools are as calibrated as the available evidence allows. That's a meaningful standard. It's also not the same as settled.

What we're building toward

Three gaps are visible from the research. We know what we don't have.

No fidelity audit tool. H5 showed that citation and accurate citation are different outcomes. A tool that surfaces the gap between what a page says and how AI systems represent it would address a problem no current GEO tool in the market touches. We don't have it. It's on the roadmap.

No content-type detection. H4 showed that human-written and LLM-generated content behave differently under the same optimisation rules. A detection layer that routes each content type to the appropriate scoring framework would close that gap. We don't have it yet.

No deep research accuracy assessment. H6 is a structural problem in AI retrieval at high source volumes. The tools we have are calibrated for standard citation contexts. Deep research accuracy at scale requires a different kind of measurement — one the field is still working out.

We're saying this because the research says it. The tools exist to help you improve citation probability and absorption readiness within what the current evidence supports. They don't solve problems the research hasn't solved. They don't claim to.

References

Aggarwal, Manas et al. (2024), "GEO: Generative Engine Optimization," ACM SIGKDD 2024. arXiv: 2311.09735

Yu, Junwei, Yang MuFeng, Yepeng Ding, and Hiroyuki Sato (2026), "Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior," preprint. arXiv: 2603.29979

Zhang, Kai, Xian He, and Jiaxin Yao (2026), "From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms," preprint. arXiv: 2604.25707

Tian, Zhihua et al. (2026), "Diagnosing and Repairing Citation Failures in Generative Engine Optimization," preprint. arXiv: 2603.09296

Liu, Zikang and Peilan Xu (2026), "Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility," preprint. arXiv: 2604.19113

"Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs," preprint. arXiv: 2605.28565

"Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents," preprint. arXiv: 2605.06635

Run the tools against your content.

The Absorption Analyser v2.0 now surfaces selection and absorption readiness separately — with evidence tiers labelled for each scoring dimension. No signup. No gates.

Try Absorption Analyser → Try Evidence Density Score