SEO Fundamentals
- Description: Search-engine-agnostic SEO — the discovery → crawl → index → rank pipeline, the universal failure modes (no sitemap, blocked
robots.txt, accidentalnoindex, thin home page, indexed preview deployments), and the Next.js App Router + Vercel implementation:app/sitemap.ts,app/robots.ts, per-pagemetadataAPI, canonical URLs, JSON-LD, mobile/Core Web Vitals. - My Notion Note ID: K2B-8-2
- Created: 2026-05-23
- Updated: 2026-05-23
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. The Pipeline
- 2. Why a Fresh Site Isn't Searchable
- 3. Sitemap in Next.js App Router
- 4. robots.txt in Next.js App Router
- 5. Per-Page Metadata
- 6. Canonical URLs and hreflang
- 7. Internal Linking and Content
- 8. Vercel-Specific Pitfalls
- 9. Mobile, Speed, Structured Data
- 10. References
1. The Pipeline
Four sequential stages — a page can fail at any one.
| Stage | What happens | Failure mode |
|---|---|---|
| Discovery | Engine learns the URL exists (inbound link, sitemap, manual submit). | URL exists but no inbound signal → never crawled. |
| Crawl | Bot fetches the HTML. | Blocked by robots.txt, 4xx/5xx, slow response, JS-only render. |
| Index | Engine parses + stores the page. | noindex meta tag, duplicate without canonical, thin content. |
| Rank | Page appears in SERPs for queries. | Indexed but outranked, off-topic, no backlinks. |
site:yourdomain.comreturning nothing → not indexed. Some pages but not others → discovery or crawl is partial.- The pipeline is the same across Google, Bing, Baidu, Yandex — engine-specific quirks live in how each stage is implemented, not the stages themselves.
2. Why a Fresh Site Isn't Searchable
Default causes, roughly in frequency order:
- Never submitted to a webmaster console (Google Search Console, Bing Webmaster Tools, Baidu 搜索资源平台) → no fast discovery path. A new domain with zero inbound links is invisible until you announce it.
- No sitemap → crawler can only discover via internal links; orphan pages stay invisible.
robots.txtaccidentally disallowing/→ blocks all crawling. The Next.js default (no file) is "allow everything"; custom configs sometimes go wrong.- Per-page
<meta name="robots" content="noindex">→ page is crawlable but not indexable. Some templates ship this for dev/staging and forget to gate it. - Preview deployments indexed → Vercel's
*.vercel.apppreviews can leak into search results, hurting canonical authority. See § 8. - Thin content on home page → engines downrank or skip. A near-empty landing page tells crawlers "low value."
- Time → new pages take days to weeks for Google and Bing, often weeks for Baidu. Patience is part of the process even after everything is correct.
3. Sitemap in Next.js App Router
Next.js 13+ supports a file-based sitemap convention. Put app/sitemap.ts:
import type { MetadataRoute } from 'next';
const BASE = 'https://example.com';
export default function sitemap(): MetadataRoute.Sitemap {
return [
{ url: `${BASE}/`, lastModified: new Date(), changeFrequency: 'weekly', priority: 1.0 },
{ url: `${BASE}/about`, lastModified: new Date(), changeFrequency: 'monthly', priority: 0.8 },
{ url: `${BASE}/thoughts`, lastModified: new Date(), changeFrequency: 'weekly', priority: 0.8 },
{ url: `${BASE}/notes`, lastModified: new Date(), changeFrequency: 'daily', priority: 0.9 },
];
}
Build emits /sitemap.xml automatically.
3.1 Generating from filesystem
For a notes site with many markdown files, enumerate dynamically:
import type { MetadataRoute } from 'next';
import { getAllNoteSlugs } from '@/lib/notes';
const BASE = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://example.com';
export default function sitemap(): MetadataRoute.Sitemap {
const noteEntries: MetadataRoute.Sitemap = getAllNoteSlugs().map((slug) => ({
url: `${BASE}/notes/${slug}`,
lastModified: new Date(),
changeFrequency: 'monthly',
priority: 0.6,
}));
return [
{ url: BASE, priority: 1.0, changeFrequency: 'weekly' },
{ url: `${BASE}/about`, priority: 0.8, changeFrequency: 'monthly' },
...noteEntries,
];
}
- Use an env var for
BASE— hard-codingvercel.appURLs in sitemaps is the #1 source of "search engine indexed my staging site" issues. - 50,000-URL limit per sitemap — for larger sites, return multiple sitemap files via
app/sitemap.tsexporting an array of generators (Next.js 14+). - lastModified matters — engines re-crawl pages whose
lastModifiedadvances. Pull from git or content metadata if possible.
4. robots.txt in Next.js App Router
Put app/robots.ts:
import type { MetadataRoute } from 'next';
const BASE = 'https://example.com';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: '*', allow: '/', disallow: ['/api/', '/admin/'] },
],
sitemap: `${BASE}/sitemap.xml`,
host: BASE,
};
}
Build emits /robots.txt.
4.1 Conditional rules — block preview deployments
The single biggest Vercel SEO trap: every Git push generates a unique *.vercel.app URL that Google can find via outbound links → duplicate-content penalty.
export default function robots(): MetadataRoute.Robots {
const isProd = process.env.VERCEL_ENV === 'production';
if (!isProd) {
return {
rules: [{ userAgent: '*', disallow: '/' }],
};
}
return {
rules: [{ userAgent: '*', allow: '/', disallow: ['/api/'] }],
sitemap: 'https://example.com/sitemap.xml',
};
}
VERCEL_ENVis'production'only on the production branch (usuallymain). Preview and development deployments get'preview'and'development'→ blanket disallow.- Belt-and-suspenders: set
X-Robots-Tag: noindexheader on non-prod viavercel.jsonor middleware.
5. Per-Page Metadata
Next.js metadata API:
// app/layout.tsx — site-wide defaults
import type { Metadata } from 'next';
export const metadata: Metadata = {
metadataBase: new URL('https://example.com'),
title: {
default: 'Yu Zhang — Engineer, Writer, Builder',
template: '%s — Yu Zhang',
},
description: 'Tech notes, career reflections, and side projects by Yu Zhang.',
openGraph: {
siteName: 'example.com',
locale: 'en_US',
type: 'website',
},
twitter: {
card: 'summary_large_image',
},
};
// app/notes/[...slug]/page.tsx — per-page overrides
export async function generateMetadata({ params }): Promise<Metadata> {
const note = await getNote(params.slug);
return {
title: note.title,
description: note.description,
alternates: { canonical: `/notes/${params.slug.join('/')}` },
};
}
- Title length: 50-60 characters renders fully in Google SERPs. Longer titles get truncated.
- Description length: 150-160 characters for desktop, 120 for mobile. Anything longer ellipsizes mid-sentence.
metadataBase— set this at the root so relative URLs inopenGraph.images,alternates.canonical, etc. resolve correctly.title.template— child pages auto-prefix the site brand without each page repeating it.
6. Canonical URLs and hreflang
Tell engines the preferred URL when multiple paths return the same content.
export const metadata: Metadata = {
alternates: {
canonical: '/about',
},
};
- Without
metadataBase, the canonical needs to be absolute (https://example.com/about). - For bilingual notes with same URL + language switcher →
alternates.languagesproduces<link rel="alternate" hreflang="...">:
alternates: {
canonical: '/notes/cpp/templates',
languages: {
'en-US': '/notes/cpp/templates?lang=en',
'zh-CN': '/notes/cpp/templates?lang=zh',
},
}
hreflangmatters most for bilingual sites targeting both English and Chinese audiences — without it, Google might serve the English version to a Chinese-speaking user, hurting bounce rate and rank. See companion notes for Google + Baidu specifics.- For truly duplicate URLs (trailing slash,
wwwvs apex, http vs https) → fix at the redirect layer (Vercel redirects ornext.config.jsredirects()), not canonical alone. Canonical is a hint; redirects are enforced.
7. Internal Linking and Content
- Crawlers discover via links. A page with zero inbound internal links is orphaned even if it's in the sitemap.
- Home page should link to every top-level section. Top-level sections should link to their key sub-pages.
- A breadcrumb component (already common in notes navigation) doubles as internal-link infrastructure and as a SERP enrichment via
BreadcrumbListJSON-LD. - Thin home page is the most common cause of poor first-impression ranking. Engines want substantive content above the fold — at minimum: site purpose in 1 sentence, latest content links, brief about.
- Anchor text matters — descriptive link text (
**K2B-8-1 Personal Site Infrastructure**) carries more weight than[click here](...).
8. Vercel-Specific Pitfalls
- Preview deployments indexed — see § 4.1. Use
VERCEL_ENVguard inrobots.ts. - Hard-coded
*.vercel.appin sitemap — happens whenmetadataBaseor sitemapBASEreads fromVERCEL_URLinstead of a custom env var. Fix:NEXT_PUBLIC_SITE_URLset in project settings, fall back toVERCEL_URLonly for previews. - Missing custom domain binding — production deploys to
<project>.vercel.app, custom domain not attached. Result: canonical URLs in sitemap point one place, real domain points another. Check Project → Domains in Vercel. X-Robots-Tag: noindexadded by some templates "for safety" — survives production deploys and silently blocks indexing. Search for this string in middleware,vercel.json, and response headers.- Trailing slash inconsistency —
next.config.jshastrailingSlash: false(default) but Vercel may serve both. Pick one, enforce via redirect. - Build-time env vars not in scope — sitemap generated at build time.
process.env.NEXT_PUBLIC_SITE_URLmust be set in Vercel project settings (not just.env.local).
9. Mobile, Speed, Structured Data
- PageSpeed Insights (
pagespeed.web.dev) → Lighthouse score. Largest Contentful Paint (LCP) under 2.5 s is the ranking target across Google and Baidu. - Mobile-first indexing — both Google and Baidu primarily evaluate the mobile version. Test responsive layout actually works, not just shrinks.
- Structured data — JSON-LD in
<head>enriches SERP rendering:
// app/page.tsx
export default function HomePage() {
const jsonLd = {
'@context': 'https://schema.org',
'@type': 'Person',
name: 'Yu Zhang',
url: 'https://example.com',
sameAs: ['https://github.com/...', 'https://twitter.com/...'],
};
return (
<>
<script
type="application/ld+json"
dangerouslySetInnerHTML={{ __html: JSON.stringify(jsonLd) }}
/>
...
</>
);
}
- Common types worth adding:
Person(homepage),ArticleorBlogPosting(notes),BreadcrumbList(navigation),WebSitewithSearchAction(sitelinks search box). - Validate at
validator.schema.organd Google's Rich Results Test. - Schema.org is recognized by Google, Bing, Yandex; Baidu has its own
data.baidu.comschema but also reads Schema.org partially.
10. References
- Google Search Central — How Google Search Works — https://developers.google.com/search/docs/fundamentals/how-search-works
- Next.js docs — Metadata files (
sitemap,robots) — https://nextjs.org/docs/app/api-reference/file-conventions/metadata - Next.js docs — Metadata and OG images — https://nextjs.org/docs/app/getting-started/metadata-and-og-images
- Vercel docs — System environment variables — https://vercel.com/docs/projects/environment-variables/system-environment-variables
- Schema.org — https://schema.org
- web.dev — Core Web Vitals — https://web.dev/articles/vitals