SEO Fundamentals


  • Description: Search-engine-agnostic SEO — the discovery → crawl → index → rank pipeline, the universal failure modes (no sitemap, blocked robots.txt, accidental noindex, thin home page, indexed preview deployments), and the Next.js App Router + Vercel implementation: app/sitemap.ts, app/robots.ts, per-page metadata API, canonical URLs, JSON-LD, mobile/Core Web Vitals.
  • My Notion Note ID: K2B-8-2
  • Created: 2026-05-23
  • Updated: 2026-05-23
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. The Pipeline

Four sequential stages — a page can fail at any one.

Stage What happens Failure mode
Discovery Engine learns the URL exists (inbound link, sitemap, manual submit). URL exists but no inbound signal → never crawled.
Crawl Bot fetches the HTML. Blocked by robots.txt, 4xx/5xx, slow response, JS-only render.
Index Engine parses + stores the page. noindex meta tag, duplicate without canonical, thin content.
Rank Page appears in SERPs for queries. Indexed but outranked, off-topic, no backlinks.
  • site:yourdomain.com returning nothing → not indexed. Some pages but not others → discovery or crawl is partial.
  • The pipeline is the same across Google, Bing, Baidu, Yandex — engine-specific quirks live in how each stage is implemented, not the stages themselves.

2. Why a Fresh Site Isn't Searchable

Default causes, roughly in frequency order:

  • Never submitted to a webmaster console (Google Search Console, Bing Webmaster Tools, Baidu 搜索资源平台) → no fast discovery path. A new domain with zero inbound links is invisible until you announce it.
  • No sitemap → crawler can only discover via internal links; orphan pages stay invisible.
  • robots.txt accidentally disallowing / → blocks all crawling. The Next.js default (no file) is "allow everything"; custom configs sometimes go wrong.
  • Per-page <meta name="robots" content="noindex"> → page is crawlable but not indexable. Some templates ship this for dev/staging and forget to gate it.
  • Preview deployments indexed → Vercel's *.vercel.app previews can leak into search results, hurting canonical authority. See § 8.
  • Thin content on home page → engines downrank or skip. A near-empty landing page tells crawlers "low value."
  • Time → new pages take days to weeks for Google and Bing, often weeks for Baidu. Patience is part of the process even after everything is correct.

3. Sitemap in Next.js App Router

Next.js 13+ supports a file-based sitemap convention. Put app/sitemap.ts:

import type { MetadataRoute } from 'next';

const BASE = 'https://example.com';

export default function sitemap(): MetadataRoute.Sitemap {
  return [
    { url: `${BASE}/`,         lastModified: new Date(), changeFrequency: 'weekly',  priority: 1.0 },
    { url: `${BASE}/about`,    lastModified: new Date(), changeFrequency: 'monthly', priority: 0.8 },
    { url: `${BASE}/thoughts`, lastModified: new Date(), changeFrequency: 'weekly',  priority: 0.8 },
    { url: `${BASE}/notes`,    lastModified: new Date(), changeFrequency: 'daily',   priority: 0.9 },
  ];
}

Build emits /sitemap.xml automatically.

3.1 Generating from filesystem

For a notes site with many markdown files, enumerate dynamically:

import type { MetadataRoute } from 'next';
import { getAllNoteSlugs } from '@/lib/notes';

const BASE = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://example.com';

export default function sitemap(): MetadataRoute.Sitemap {
  const noteEntries: MetadataRoute.Sitemap = getAllNoteSlugs().map((slug) => ({
    url: `${BASE}/notes/${slug}`,
    lastModified: new Date(),
    changeFrequency: 'monthly',
    priority: 0.6,
  }));

  return [
    { url: BASE,              priority: 1.0, changeFrequency: 'weekly'  },
    { url: `${BASE}/about`,   priority: 0.8, changeFrequency: 'monthly' },
    ...noteEntries,
  ];
}
  • Use an env var for BASE — hard-coding vercel.app URLs in sitemaps is the #1 source of "search engine indexed my staging site" issues.
  • 50,000-URL limit per sitemap — for larger sites, return multiple sitemap files via app/sitemap.ts exporting an array of generators (Next.js 14+).
  • lastModified matters — engines re-crawl pages whose lastModified advances. Pull from git or content metadata if possible.

4. robots.txt in Next.js App Router

Put app/robots.ts:

import type { MetadataRoute } from 'next';

const BASE = 'https://example.com';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: '*', allow: '/', disallow: ['/api/', '/admin/'] },
    ],
    sitemap: `${BASE}/sitemap.xml`,
    host: BASE,
  };
}

Build emits /robots.txt.

4.1 Conditional rules — block preview deployments

The single biggest Vercel SEO trap: every Git push generates a unique *.vercel.app URL that Google can find via outbound links → duplicate-content penalty.

export default function robots(): MetadataRoute.Robots {
  const isProd = process.env.VERCEL_ENV === 'production';

  if (!isProd) {
    return {
      rules: [{ userAgent: '*', disallow: '/' }],
    };
  }

  return {
    rules: [{ userAgent: '*', allow: '/', disallow: ['/api/'] }],
    sitemap: 'https://example.com/sitemap.xml',
  };
}
  • VERCEL_ENV is 'production' only on the production branch (usually main). Preview and development deployments get 'preview' and 'development' → blanket disallow.
  • Belt-and-suspenders: set X-Robots-Tag: noindex header on non-prod via vercel.json or middleware.

5. Per-Page Metadata

Next.js metadata API:

// app/layout.tsx — site-wide defaults
import type { Metadata } from 'next';

export const metadata: Metadata = {
  metadataBase: new URL('https://example.com'),
  title: {
    default: 'Yu Zhang — Engineer, Writer, Builder',
    template: '%s — Yu Zhang',
  },
  description: 'Tech notes, career reflections, and side projects by Yu Zhang.',
  openGraph: {
    siteName: 'example.com',
    locale: 'en_US',
    type: 'website',
  },
  twitter: {
    card: 'summary_large_image',
  },
};
// app/notes/[...slug]/page.tsx — per-page overrides
export async function generateMetadata({ params }): Promise<Metadata> {
  const note = await getNote(params.slug);
  return {
    title: note.title,
    description: note.description,
    alternates: { canonical: `/notes/${params.slug.join('/')}` },
  };
}
  • Title length: 50-60 characters renders fully in Google SERPs. Longer titles get truncated.
  • Description length: 150-160 characters for desktop, 120 for mobile. Anything longer ellipsizes mid-sentence.
  • metadataBase — set this at the root so relative URLs in openGraph.images, alternates.canonical, etc. resolve correctly.
  • title.template — child pages auto-prefix the site brand without each page repeating it.

6. Canonical URLs and hreflang

Tell engines the preferred URL when multiple paths return the same content.

export const metadata: Metadata = {
  alternates: {
    canonical: '/about',
  },
};
  • Without metadataBase, the canonical needs to be absolute (https://example.com/about).
  • For bilingual notes with same URL + language switcher → alternates.languages produces <link rel="alternate" hreflang="...">:
alternates: {
  canonical: '/notes/cpp/templates',
  languages: {
    'en-US': '/notes/cpp/templates?lang=en',
    'zh-CN': '/notes/cpp/templates?lang=zh',
  },
}
  • hreflang matters most for bilingual sites targeting both English and Chinese audiences — without it, Google might serve the English version to a Chinese-speaking user, hurting bounce rate and rank. See companion notes for Google + Baidu specifics.
  • For truly duplicate URLs (trailing slash, www vs apex, http vs https) → fix at the redirect layer (Vercel redirects or next.config.js redirects()), not canonical alone. Canonical is a hint; redirects are enforced.

7. Internal Linking and Content

  • Crawlers discover via links. A page with zero inbound internal links is orphaned even if it's in the sitemap.
  • Home page should link to every top-level section. Top-level sections should link to their key sub-pages.
  • A breadcrumb component (already common in notes navigation) doubles as internal-link infrastructure and as a SERP enrichment via BreadcrumbList JSON-LD.
  • Thin home page is the most common cause of poor first-impression ranking. Engines want substantive content above the fold — at minimum: site purpose in 1 sentence, latest content links, brief about.
  • Anchor text matters — descriptive link text (**K2B-8-1 Personal Site Infrastructure**) carries more weight than [click here](...).

8. Vercel-Specific Pitfalls

  • Preview deployments indexed — see § 4.1. Use VERCEL_ENV guard in robots.ts.
  • Hard-coded *.vercel.app in sitemap — happens when metadataBase or sitemap BASE reads from VERCEL_URL instead of a custom env var. Fix: NEXT_PUBLIC_SITE_URL set in project settings, fall back to VERCEL_URL only for previews.
  • Missing custom domain binding — production deploys to <project>.vercel.app, custom domain not attached. Result: canonical URLs in sitemap point one place, real domain points another. Check Project → Domains in Vercel.
  • X-Robots-Tag: noindex added by some templates "for safety" — survives production deploys and silently blocks indexing. Search for this string in middleware, vercel.json, and response headers.
  • Trailing slash inconsistencynext.config.js has trailingSlash: false (default) but Vercel may serve both. Pick one, enforce via redirect.
  • Build-time env vars not in scope — sitemap generated at build time. process.env.NEXT_PUBLIC_SITE_URL must be set in Vercel project settings (not just .env.local).

9. Mobile, Speed, Structured Data

  • PageSpeed Insights (pagespeed.web.dev) → Lighthouse score. Largest Contentful Paint (LCP) under 2.5 s is the ranking target across Google and Baidu.
  • Mobile-first indexing — both Google and Baidu primarily evaluate the mobile version. Test responsive layout actually works, not just shrinks.
  • Structured data — JSON-LD in <head> enriches SERP rendering:
// app/page.tsx
export default function HomePage() {
  const jsonLd = {
    '@context': 'https://schema.org',
    '@type': 'Person',
    name: 'Yu Zhang',
    url: 'https://example.com',
    sameAs: ['https://github.com/...', 'https://twitter.com/...'],
  };
  return (
    <>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{ __html: JSON.stringify(jsonLd) }}
      />
      ...
    </>
  );
}
  • Common types worth adding: Person (homepage), Article or BlogPosting (notes), BreadcrumbList (navigation), WebSite with SearchAction (sitelinks search box).
  • Validate at validator.schema.org and Google's Rich Results Test.
  • Schema.org is recognized by Google, Bing, Yandex; Baidu has its own data.baidu.com schema but also reads Schema.org partially.

10. References