AI SEO in 2026: Structured Data as Identity Layer

In 2026, structured data stopped being a rich-result optimisation and became an identity layer. What the MLAD prompt corpus and a production SEO spec reveal about the gap between being found and being cited.

Greg Ruthenbeck

Greg Ruthenbeck

Apr 17, 2026 · 8 min

Every site ships content. Few sites ship identity. In 2026, AI systems can find your pages. The question is whether they can figure out who wrote them, whether to trust them, and how to refer to you when they cite what they found.

Google's March 2026 core update narrowed rich result eligibility for FAQ, How-To, and Review schema that had been applied to pages where it didn't match the primary content. At the same time, a less visible change: AI Mode began using structured data for entity resolution and source credibility during answer synthesis. Schema that accurately describes content now increases the probability of AI citation even when no visual rich result is displayed.

The shift reframes what structured data is for. It is no longer a display optimisation. It is an identity layer.

This article draws on two sources. The first is a practitioner-built SEO reference specification, compiled during the launch of a production site (mlad.ai), that documents the implementation surface for Article JSON-LD, BreadcrumbList, Open Graph, and LLM-SO infrastructure against Google's April 2026 guidance. The second is the MLAD prompt corpus: 5,399 prompts from 34 open-source collections, where 78 address AI discoverability directly and another 170 address the adjacent problem of structuring content for machine consumption.

What the spec covers

The reference specification was written to answer a question that arises on every site launch: what goes in the <head>?

The answer in 2026 is longer than it was in 2023. Article JSON-LD now carries recommended fields for author.url (closing the entity loop to an on-site profile page), author.sameAs (external authority profiles that disambiguate the person in Google's Knowledge Graph), dateModified (which must propagate identically across JSON-LD, Open Graph, and sitemap <lastmod> to avoid conflicting freshness signals), and publisher.logo as an ImageObject rather than a bare URL.

BreadcrumbList schema, omitted from many developer-built sites, communicates content hierarchy to crawlers. A codex entry at mlad.ai/codex/reddit-discovery-and-analysis/pipeline-and-first-result/the-scoring-fix sits four levels deep. Without breadcrumb markup, a retrieval system sees a flat page. With it, the system can infer that this page belongs to a chapter, inside a walkthrough, inside a curriculum. The structural context changes what the page means.

The spec also documents decisions about speakable schema (beta, monitors passage-level extraction for voice assistants), isPartOf (connecting articles to a parent WebSite entity), and knowsAbout on the author Person (declaring topic expertise for AI source selection). None of these are required. All of them reduce the ambiguity a retrieval system faces when deciding whether to cite a page.

The full specification runs to nine sections. This article focuses on the parts where implementation exposed problems the spec alone did not predict.

What broke during implementation

A gap table in the spec audits 30+ fields. Three findings illustrate how structured data breaks in practice, not in theory.

The author.name field was set to "Greg Ruthenbeck PhD." Google's Article structured data guidance says not to include credentials in the name. The rationale is entity resolution: a name with embedded titles creates a different string from the same name on LinkedIn, Google Scholar, or IEEE Xplore. The Knowledge Graph has to guess whether "Greg Ruthenbeck" and "Greg Ruthenbeck PhD" refer to the same person. The fix is mechanical: move "PhD" to honorificSuffix and let four sameAs links do the disambiguation work that a name suffix cannot.

The second finding is a class the reference specification calls out explicitly: contradictory author assertions across tag systems. Article JSON-LD, Open Graph's article:author, and Twitter's twitter:creator each carry an author claim, and a page that names the person in one and the brand in another gives a retrieval system two answers to the same question. One tag says the organisation wrote it; another says a human did. The mismatch weakens the signal that sameAs was meant to strengthen: an LLM encountering two author entities has to choose one, and either choice discards information. The corpus offers no guard against this. The seo-content-auditor prompt scores E-E-A-T and flags missing author bios, but its audit surface stops at presence. Whether the author in JSON-LD agrees with the one in Open Graph is not a question the prompt asks. mlad.ai shipped with exactly that mismatch; neither an absence audit nor a length audit would have flagged it.

The production sitemap had articles at priority 0.6, barely above the privacy policy at 0.5. The /articles index page, which lists every article with titles and summaries and is the most useful single page for a retrieval system scanning the site, was marked changefreq: monthly at priority 0.5, with no lastmod. After tuning: articles at 0.8, section indexes at 0.7, changefreq: weekly, lastmod reflecting the most recent publication date. The hierarchy went from flat to readable in an afternoon. Nobody had deliberately chosen to flatten it. The defaults simply had not been revisited since they were first set.

What the corpus says about how practitioners think

Seventy-eight of 5,399 prompts in the MLAD corpus address AI discoverability. Filter the Prompt Explorer to search "seo" and they surface.

The taxonomy profile across those 78 prompts: 59% Guided, 38% Bounded, 3% Scripted, zero Open. No discoverability prompt in the corpus grants the model unconstrained autonomy. An earlier analysis of marketing prompts in the same corpus found a nearly identical profile: 71.8% Bounded. Practitioners working on AI discoverability and practitioners working on marketing AI converge on the same posture. Structure is not optional.

The Activity axis skews toward Understand (58%) over Create (21%). The prompts are built to audit, score, and diagnose. They are not built to generate schema or write meta tags. The ecosystem frames discoverability as an inspection problem.

Two prompts in the set open with a question most tutorials skip: should you do this at all?

The schema-markup skill scores every implementation candidate on a six-category index before writing any JSON-LD. Content-Schema Alignment carries the heaviest weight. If the score falls below 55, the output is "Do Not Implement" with an explanation. The phrase in the prompt: "You do not guarantee rich results. You do not add schema that misrepresents content."

The programmatic-seo skill runs a parallel gate. Its Feasibility Index weights "Unique Value per Page" at 25%, with a kill switch: high impressions combined with low engagement triggers a halt to indexing. Below 50, the prompt refuses. Its one-line policy: "100 excellent pages > 10,000 weak ones."

These quality gates do not appear elsewhere in the corpus. The prompts that score and refuse before implementing are also the most precisely written. The correlation is not accidental. A prompt that has to make a stop/go decision needs tighter language than one that assumes the decision is already made.

What the corpus does not cover

The 78 prompts handle audits, rankings, and citations. No prompt in the corpus is dedicated to:

Canonical URLs and duplicate resolution. sameAs and Knowledge Graph entity linking. Article schema with author byline binding. Open Graph metadata design. Twitter/X card configuration. Social card image dimensions and safe zones. BreadcrumbList schema.

These are the identity surfaces. They determine whether an LLM can correctly attribute what it finds, not just find it. The corpus is optimised for the question "is this site visible?" It has not yet reached the question "when this site is cited, will the citation be correct?"

The gap maps onto a distinction from the ai-seo prompt, the largest in the corpus at 17,195 characters: "Traditional SEO gets you ranked. AI SEO gets you cited." The corpus has built tooling for the first half of that sentence. The second half, where identity and attribution live, is largely empty.

Where llms.txt sits

The corpus contains two dedicated llms.txt prompts, both from copilot-instructions. Both follow the llmstxt.org specification. Neither scores. Neither gates. Neither questions whether the file is worth creating.

The external evidence is now available. A Rankability study across nearly 300,000 domains found 10% adoption and no measurable effect on AI citations. An OtterlyAI experiment over 90 days found 84 AI-bot visits to /llms.txt out of 62,100 total AI-bot hits. No major LLM provider has committed to parsing the file.

The search-ai-optimization-expert prompt, written before the 300,000-domain studies were published, includes a note: "llms.txt currently experimental and not yet adopted by major AI providers." That is honest meta-awareness inside a prompt built to do the work the file describes.

mlad.ai serves six llms.txt files. The sitemap registers the root file at priority 0.9. The rationale is not citation impact but cost-benefit: the files double as machine-readable site summaries regardless of whether any crawler specifically seeks the filename. Serving is cheap. Removing them saves nothing. But the evidence does not support treating llms.txt as a strategy. It is, at most, an inexpensive hedge.

Robots.txt as an active decision

One area the spec documents in detail and the corpus barely touches: crawler access policy.

mlad.ai's robots.txt separates search bots from training crawlers by user agent. OAI-SearchBot, ChatGPT-User, and PerplexityBot get full site access. They power live retrieval and are the primary channel through which AI systems discover content at query time. GPTBot, ClaudeBot, Google-Extended, anthropic-ai, and six other training-focused crawlers are restricted to llms.txt files and the prompt taxonomy landing page. Individual prompt pages, codex entries, and course content are blocked.

The tradeoff is deliberate. Product content stays out of training data. Summaries remain available for entity-level recognition. Articles, which are the primary citation surface, are accessible to both tiers.

Cloudflare's March 2026 crawler data gives context to this decision. GPTBot's crawl-to-refer ratio is 1,276 to 1. ClaudeBot's is 23,951 to 1. Training crawlers visit at scale and refer almost never. The distinction between "crawled for training" and "crawled for retrieval at query time" is not hypothetical. It is the difference between content consumed and content attributed.

Two prompts in the corpus mention robots.txt, both inside broader audit skills. None treat crawler-tier policy as a design decision. The spec does, because the implementation required it.

Where to start

The Prompt Explorer carries every prompt discussed here, classified and browsable.

Search "seo" for the discoverability cluster: 78 prompts, weighted toward audit and scoring, with the constraint profile that matches the marketing findings.

Search "schema.org" for the structured data subset, including the prompt that gates implementation with a Do-Not-Implement verdict.

Search "geo" for the citation-specific prompts, where passage-level citability and the SEO-vs-GEO comparison table live.

The SEO / Social / LLM-SO Reference specification is published alongside this article. It covers the full implementation surface for Article JSON-LD, BreadcrumbList, Open Graph, and LLM-SO infrastructure, with the gap table, the image specs, and the testing workflow.

Glossary

AI SEO / LLM-SO / GEO
Synonyms for the discipline of making content citable by AI systems. "AI SEO" emphasises continuity with search-engine work; "LLM-SO" foregrounds the model as the audience; "GEO" (Generative Engine Optimization) is the term that emerged in 2024 research literature.
BreadcrumbList
A schema.org type declaring a page's position in a site hierarchy. Crawlers use it to infer structural context: that an article belongs to a chapter, inside a curriculum.
E-E-A-T
Google's content-quality rubric, an acronym for Experience, Expertise, Authoritativeness, Trustworthiness. Surfaced in structured data through author credentials, sameAs links, and publisher identity.
Entity resolution
The process by which a retrieval system decides which real-world entity a text reference refers to. Structured data reduces ambiguity by asserting identifiers that disambiguate, for example, one "Greg Ruthenbeck" from every other string that matches the name.
JSON-LD
JavaScript Object Notation for Linked Data. The structured-data format Google recommends, embedded in pages via <script type="application/ld+json">. Carries schema.org entities such as Article, Person, and BreadcrumbList.
llms.txt
A proposed convention (llmstxt.org) for a root-level file that summarises a site for LLMs. Adoption across 300,000 domains studied sits near 10%; measurable citation impact has not been shown.
Open Graph
Meta-tag protocol originating at Facebook, now the dominant format for link previews across social platforms and AI systems. Includes tags such as og:title, og:description, and og:article:author.
sameAs
A schema.org property linking an entity to external URLs representing the same entity (LinkedIn, ORCID, IEEE Xplore, Google Scholar). The primary mechanism for cross-platform identity disambiguation.
schema.org
A shared vocabulary for structured data, maintained jointly by Google, Microsoft, Yahoo, and Yandex. Defines the type hierarchy (Article, Person, Organization) that JSON-LD instances populate.
Greg Ruthenbeck

Greg Ruthenbeck

PhD in computing, 13 years teaching at university. Building MLAD.ai for developers working with AI. Retired from custom drones and RC sailplanes in favour of saunas, ice swims, planted aquariums, and the open question of whether ginger beer is the next homebrew.

Tagged: #SEO#LLM-SO#Structured Data