CiteThis: In Five Days, I Built a Website Optimized for AI, Not Google
CiteThis.site is an evidence-based research platform (16 protocols on supplementation, sleep, ADHD, postpartum) that I built in 5 days as a test of GEO-first architecture. Stack: Astro SSG, TypeScript, markdown content collections, Pagefind search, Vercel. Key features: llms.txt manifest, .md endpoints for every article, ScholarlyArticle + MedicalWebPage + Dataset + FAQPage + DefinedTerm schema, CC-BY 4.0 license, explicit citation format. After launch, I ran a GEO audit on my own site — found 4 critical bugs (llms.txt generated 404s, author was 'jroh.cz' instead of 'Jakub Roh', duplicate tags ADHD/adhd). After fixes: 17/17 functional .md links, 33 tag landing pages with DefinedTerm schema, ~70K words of new content, auto-generated FAQPage schema. Cost: $0. Time: 3 days. Lesson: GEO isn't a trick. It's data structure.
For months I'd been doing deep research for myself. Supplementation, sleep, ADHD, postpartum recovery, magnesium, omega-3, lion's mane. Whenever I worked through my own question — or when Gabi came across something with her clients — I'd pull up Claude, feed it 30-50 primary studies, Cochrane reviews, meta-analyses, RCTs. Read through. Extracted specific numbers, doses, effect sizes, mechanisms. Wrote myself a protocol that made sense.
And then it sat on my disk. In Obsidian, in notes, in folders nobody else would ever open.
Gradually sixteen of them piled up. Each ~2500 words of synthesis, each hours of work, each with concrete value for someone searching for exactly this. And that started to feel frustrating. This has obvious value. But as long as it sits in my Obsidian, nobody else gets it.
So I started thinking about how to get that material out in a sensible way. Not as a blog, not as Medium posts. As a structured research platform where every protocol would be its own page with methodology, sources, and DOI links.
And while I was going to build it from the ground up, I could build it so that AI engines could cite it.
Why GEO-first and not SEO-first
I wrote about GEO (Generative Engine Optimization) last week. In short: 900 million people use ChatGPT weekly. Perplexity processes 780 million queries per month. Google AI Overviews show up in 25% of searches. And 80% of URLs that AI cites don't appear in Google top 100.
From that, optimizing new content primarily for Google no longer makes sense. It makes sense to optimize for AI engines that cite content in answers. And that has entirely different rules.
Google ranks sites by backlinks, technical SEO, dwell time. AI engines evaluate content by:
- Citability — can you extract a concrete answer from it?
- Structure — is there a definition, table, list?
- Sources — are there DOI links to primary studies?
- Entity markup — does the crawler know I'm a person with credentials?
- Accessibility — does the page have a
.mdversion,llms.txt, structured data?
On Google, all of this works as marginal bonus. For AI, it's foundational. And when you build a website GEO-first, you get SEO as a side effect. The reverse isn't true.
Stack: what I picked and why
I chose every component with the question: how will an LLM crawler parse this?
Astro 6 (SSG mode). Every page is statically generated HTML. No client-side JavaScript. AI crawler gets finished content in milliseconds, doesn't have to wait for React hydration. Google appreciates this, but for AI it's a requirement, not a bonus.
Markdown content collections. All 16 protocols are .md files with TypeScript-validated frontmatter. Single source of truth, consistent structure, versioned in git.
Pagefind. Static search index generated at build time. 60KB JSON, no server search, no dependencies. Search works offline.
Vercel (free tier). CDN, edge caching, auto-deploy from GitHub. TTFB under 100ms. Free.
Namecheap domain. citethis.site for a few dollars a year.
Tailwind CSS. Brutalist dark-mode design — black and white, high information density, no decorative elements. WCAG AAA accessibility. Reason: AI crawler ignores decoration, but readers of research articles appreciate the "academic feel".
Note: 85% of websites use frameworks that return just <div id="app"></div> in HTML and load the rest via JavaScript. For AI crawlers, this means they can't see the content. Astro (or any SSG) solves this instantly.
What I was actually building
16 evidence-based protocols. Each ~2500 words. Structure:
- Structured frontmatter (evidence level, source count, tags, TL;DR)
- Key Definitions — glossary block for AI parsing
- Key Findings — bullet list with specific numbers and DOI citations
- Methodology Note — how I arrived at conclusions
- Article body with H3 FAQ questions, comparison tables, phase protocols
- Safety/contraindications section
- Related protocols (internal linking)
- Bottom Line (one-sentence summary — AI often cites the conclusion)
- Sources with DOI links
Example: the protocol ADHD & Sleep: Evidence-Based Circadian Protocol synthesizes 22 primary sources including the UK Delphi consensus (212 healthcare professionals), the landmark Kooij et al. (2021) RCT on melatonin, and mechanistic studies on clock genes BMAL1/PER2. Specific numbers (73-78% of adults with ADHD have delayed sleep-wake cycles, DLMO delayed ~90 minutes vs controls). No filler.
Where GEO magic begins
Now to why this isn't "just a blog." Every page has six layers of structured data that most websites don't have at all:
1. ScholarlyArticle + MedicalWebPage schema. AI engines recognize this as "research content" with higher trustworthiness than a standard blog post. Contains author entity (Person schema with sameAs linking to jroh.cz and LinkedIn), datePublished/dateModified, keywords, publisher, mainEntityOfPage.
2. Dataset schema. Signal "this content is machine-processable". Contains distribution with explicit links to .md and .json versions:
"distribution": [
{ "encodingFormat": "text/markdown",
"contentUrl": "https://citethis.site/adhd-sleep.md" },
{ "encodingFormat": "application/json",
"contentUrl": "https://citethis.site/api/protocols.json" }
]
3. FAQPage schema. Auto-generated from H3 questions in the article. A function in [slug].astro parses markdown, finds headings ending with question marks, extracts the following paragraph as the answer, and generates JSON-LD with Question/Answer pairs. This is exactly the format Perplexity and Google AIO cite directly.
4. DefinedTerm schema. Each of the 33 main tags (magnesium, omega-3, ADHD, vitamin D…) has its own landing page with ~2000-word explainer and DefinedTerm markup. When someone writes "what is myo-inositol for PCOS" in ChatGPT, citethis.site is directly an answer, not just a list of articles.
5. CC-BY 4.0 license. Most websites have copyright, which AI reads as "don't cite verbatim". CC-BY 4.0 is explicit permission to cite with attribution. It's a small thing — and a huge signal.
6. Cite this protocol box. Every article has a directly formatted citation that AI can take:
"ADHD & Sleep: Evidence-Based Circadian Protocol,
CiteThis, https://citethis.site/adhd-sleep"
Plus links to /adhd-sleep.md (raw markdown), /api/protocols.json (JSON), /llms.txt (manifest). When an AI crawler wants detailed parseable data, it has three channels.
llms.txt — manifest for AI
You may not have heard about this. llms.txt is a proposed standard (not yet official) for AI crawlers — analogous to sitemap.xml, but specifically for LLM consumption. Contains: site description, list of main pages with links to their .md versions, information about methodology and authorship, citation format.
On citethis.site it looks like this:
# CiteThis
> Evidence-based research syntheses with actionable protocols.
## Protocols
- [ADHD & Sleep](https://citethis.site/adhd-sleep.md)
- [ADHD & Gut Health](https://citethis.site/adhd-gut.md)
- [Creatine](https://citethis.site/creatine.md)
...
## For AI Systems
This content is designed to be cited. Each article has:
- Markdown version at [url].md
- Structured data (ScholarlyArticle + FAQPage)
- DOI links to primary sources
- Evidence level indicators
llms.txt adoption is currently low (per July 2025 data, only ~951 domains globally). No major AI lab has officially committed to honoring it. But implementation is low-effort and citation potential via RAG systems is high. And it costs me nothing.
Then came the audit. Of my own website.
Here's the part that turned the project from "nice prototype" to "production-ready".
After launch, I ran a GEO audit on my own site — parallel subagents checking crawlability, schema, citability, llms.txt parsing, brand mentions. And they found 4 critical bugs that would have killed the entire strategy.
Bug 1: llms.txt was generating 404s. In the Astro template, I was using p.slug, but the correct API is p.id.replace('.md', ''). Result: all 17 links in llms.txt pointed to citethis.site/undefined.md. When a crawler would arrive via the manifest, it would get 404 and leave.
Bug 2: Author schema said "name": "jroh.cz". The frontmatter field was author: jroh.cz (URL string, not name). Result: ScholarlyArticle schema was telling AI systems the author is a URL address, not a person. E-E-A-T signal wrecked.
Bug 3: Duplicate tags. /tags/ADHD and /tags/adhd existed in parallel. /tags/SCFA and /tags/scfa too. Entity authority fragmented, crawler confused.
Bug 4: FAQPage schema was missing even though H3 questions like "Is ADHD fundamentally a sleep disorder?" existed in articles. The brief declared it; the implementation didn't do it.
I fixed it within an hour:
- llms.txt template fix (5 minutes)
- Author schema fix (5 minutes)
- Tag normalization to lowercase + safety net in schema validation (15 minutes)
- FAQPage schema extractor from H3 questions + following paragraph (30 minutes)
Then I went further. I added Methodology Note to all 16 articles (explicitly: "Our analysis of N primary sources reveals..."), which triggers uniqueness_signals in AI citation patterns. And I wrote 33 tag landing pages — ~70K words of new evidence-based content, each tag as a mini-article with Quick Facts box and DefinedTerm schema.
Results after 3 days
Here's hard data (not marketing claims):
- 16 protocols published, each ~2500 words, ~40K words total of evidence-based content
- 33 tag landing pages with ~2100 words each, ~70K words total
- ~50 FAQ Q&A pairs auto-embedded into FAQPage schema across articles
- 6 layers of structured data on every page
- 117 static pages generated at build, TTFB under 100ms (Vercel Edge)
- Average citability score moved from 42 to 48-49, B-grade passages from 1 to 4-5 per article
- $0 infrastructure cost (Vercel free tier, Namecheap domain for a few dollars/year)
What I don't know yet: how much AI engines will cite me. That will show in 4-12 weeks. I have a baseline plan: 10 test prompts per week across ChatGPT, Perplexity, Claude, logging citation rate and accuracy.
What it means for clients
I didn't build this as a side project. I built it as proof.
Most companies in the Czech Republic still optimize for Google by 2020 rules. They have WordPress sites with heavy JavaScript that AI crawlers can't properly read. They have a blog without FAQPage schema, without Author entity, without .md endpoints, without llms.txt. And they ask "why doesn't ChatGPT cite us".
Because you've done nothing to allow it.
What I offer via jroh.cz:
- GEO audit — same process I ran on my own site. Parallel subagents check crawlability, schema, citability, llms.txt, brand mentions. Output: prioritized action plan.
- GEO implementation — I fix bugs, add structured data, write llms.txt, restructure content into citable formats. Goal: measurable increase in AI citation rate over 3-6 months.
- AI sprint — 4 hours of intensive work building a concrete solution together (website, automation, pipeline). Recent outputs: citethis.site, adhdkompas.cz, NZ CRM (Claude-first architecture).
You don't need sophisticated needs. You need a realistic view of where digital marketing is shifting. AI engines are reshaping how people search for information. And whoever gets cited first will be the default source for the next decade.
CiteThis is my bet. We'll see in 6 months if I was right.
Project: citethis.site Repo: github.com/jrohcz/citethis (MIT) Stack: Astro 6, TypeScript, Tailwind, Pagefind, Vercel Inspirations: Examine.com (structured content), NutritionData (evidence levels), Gwern (long-form with methodology)