PRDC — Links scan, classification, and memo design

Overview

PRDC is a non-interactive tool that scans the wiki’s Markdown, classifies every link destination, persists a mergeable ledger to disk, and applies one content mutation: localhost links are replaced with italic text. Prefer a standalone Python implementation, consistent with other repo maintenance scripts under docs/superpowers/scripts/.

This document is the authoritative description of memo behavior. It refines the plan note that url is “the same string the parser extracted”: in PRDC, url is the canonical memo key after the normalization rules below (parser output is an input, not the final key).

Goals

  • Scan all in-scope wiki Markdown, extract inline and reference link destinations (any scheme or relative path the Markdown layer exposes).
  • Classify each extracted link per Appendix A (non-URL rules and url classify table).
  • Maintain /_memoize/LINKS.json across runs with stable merge semantics and no prompts.
  • Rewrite localhost:[PORT] links in source Markdown to italics on every run, with no dry-run mode and no separate CLI flag for that behavior (operator intent is “run PRDC = scan + classify + memo + localhost fix”).

Non-goals

  • Building or serving the Jekyll site as part of PRDC.
  • Tracking links in category NONE (they never appear in JSON).
  • Interactive confirmation, TTY prompts, or per-file approval gates.

Scan scope

Include

  • Markdown files that are wiki content: topic trees such as dotnet/, architecture/, data/, and analogous content directories defined at implementation time (mirror CLAUDE.md / STRUCTURE.md reality).

Exclude

  • GitHub / community root files at repository root: at minimum README.md, CONTRIBUTING.md, CHANGELOG.md, CONTRIBUTORS.md, LICENSE, STRUCTURE.md, and plan*.md scratch plans unless explicitly included later.
  • Agent / Claude meta (e.g. .claude/, CLAUDE.md if treated as meta).
  • Superpowers docs under docs/superpowers/ (specs, plans, reports, scripts) unless explicitly re-scoped later.
  • Hub index.md files anywhere under wiki content trees (including repository-root index.md): exclude the entire file from scanning.
  • Backward / forward navigation between Markdown content (the plan’s “between md content” nav patterns — define precise patterns during implementation, e.g. same-folder [<](./index.md) / depth-adjusted root index links, so PRDC does not ledger or classify them).

Extraction

  • A link is any Markdown link destination: https, http, mailto, tel, relative paths, other schemes, fragments, autolinks if present, including reference-style link destinations.
  • Parser choice is implementation detail; it must resolve reference definitions to a concrete href string before normalization.

Classification

Apply the type → category table in Appendix A (images → review, mermaid/ascii → diagram, Google/Drive → resource, clear extensions → extension, CI/badge links → NONE, programming-language sample extensions → sample, else review unless url classify matches).

url classify uses the rules in Appendix A (npm, nuget, blog, pdf/resource, GitHub repo shape, Mongo docs hosts, Wikipedia, Docker Hub, Microsoft / Google doc hosts, O’Reilly, YouTube, gist, SlideShare, Stack Exchange family → mapped categories; anything else under web classification path → review unless another rule fires first).

Ordering: Define a deterministic rule order in implementation (e.g. NONE and special cases before generic http rules) so two runs on the same tree produce the same category for the same canonical url.

Memo file

  • Path: /_memoize/LINKS.json (create /_memoize/ on first run if missing).
  • Shape: JSON array of objects: { "url": "<string>", "category": "<string>" } only (no file field).
  • Serialize every in-scope extracted link except category NONE (those never appear).
  • Uniqueness: at most one object per canonical url after normalization.
  • Determinism: write JSON with stable sorting (e.g. sort by url, then category) so diffs are repeatable.

Canonical url (memo key)

http / https — aggressive normalization

Input: parser-extracted href for http/https.

  • Lowercase scheme and host; apply IDNA / punycode for the host when applicable.
  • Strip default ports (:443 for https, :80 for http).
  • Percent-decode path segments where decoding is idempotent and safe (no double-decoding loops). If decoding would change semantics ambiguously, keep the encoded form for that segment (document edge cases in implementation tests).
  • Resolve dot segments in the path (/./, /../) when purely hierarchical.
  • Path empty vs root: normalize bare https://host to use path / consistently.
  • Trailing slash: remove a single trailing / from the path when the path has more than one character and ends with / (e.g. /wiki/foo//wiki/foo; / remains /).
  • Fragment: retain #fragment as part of the canonical web URL (different fragments ⇒ different memo rows).
  • Query: retain query parameters, but remove query keys that match a built-in only blocklist (T1). Strip entire duplicate keys consistently; sort remaining keys for stable serialization. Initial built-in keys should include common trackers, including at minimum: utm_source, utm_medium, utm_campaign, utm_term, utm_content, gclid, fbclid, msclkid, mc_eid, yclid, _ga, spm (extend in code with comments; spec lists intent, code lists exact names).

mailto / tel — light normalization (W2)

  • mailto:: lowercase scheme; trim ASCII whitespace around the full href; for mailto:user@host, lowercase host part; preserve ?subject= / &body= query semantics with the same tracking-key strip as http where applicable.
  • tel: strip ASCII whitespace, parentheses, and hyphens from the tel-national portion while preserving leading + and digits; scheme lowercased.

These remain mailto:... / tel:... strings — not repo-relative paths.

Path-like / in-repo keys

For destinations that are not http, https, mailto, or tel (relative paths, root-anchored site paths, other schemes classified into the path pipeline):

  • Resolve relative to the source Markdown file’s directory to a path under the repository root.
  • Emit repo-relative POSIX paths (folder/file.md), even on Windows (P1).
  • Strip #fragment from identity for these keys (F2): canonical url is the file path only.
  • If a path cannot be resolved to an existing repo file: emit a best-effort repo-relative POSIX path (after syntactic normalization) and classify as review unless a higher-priority rule applies.

localhost:[PORT] mutation

  • When a link destination matches localhost with an explicit port (pattern fixed in implementation, aligned with the plan’s intent), remove the Markdown link and replace the link text with italic using the same visible text (or the literal localhost:port if link text is empty — edge case defined in implementation).
  • Apply whenever PRDC runs, in the same pass as scanning, without a dry-run default and without a dedicated “fix localhost” flag.

Merge semantics (LINKS.json)

On startup:

  1. If LINKS.json is missing, start from an empty array.
  2. Otherwise load existing rows.

During the run:

  • Compute canonical url → category for every extracted in-scope link (skipping NONE for persistence).
  • Insert rows for new canonical url values.
  • Update category when the newly computed category differs from the stored one, except:
    • If an existing row has category != "review", do not re-classify; keep stored category and row as-is for that url.
  • Rows with category == "review" are re-evaluated every run until the computed category becomes something other than review, at which point the row updates and then participates in the frozen non-review rule above.

Write the merged array back to LINKS.json with stable sorting.

CLI / execution

  • Non-interactive: never prompt; failures print a clear message and exit non-zero.
  • No dry-run mode and no separate flag for localhost rewriting (approved operator model: running PRDC applies fixes).

Exact invocation (python -m …, script name, arguments such as repo root defaulting to cwd) is left to the implementation plan phase.

Testing (minimum)

  • Golden tests for normalization: representative http/https, mailto/tel, and POSIX path keys (including fragment strip for paths, fragment keep for https).
  • Merge tests: frozen non-review, review reassessed, NONE omitted.
  • Fixture Markdown snippets: inline + reference links, at least one localhost:PORT before/after rewrite.

Open items for implementation plan only

  • Exact directory glob list for “wiki content” vs excluded roots (derived from STRUCTURE.md / current tree).
  • Precise regex or AST patterns for nav exclusion (hub and sibling nav).
  • Parser library choice and version pinning.

Non-URL / general type → category

type category
urls/hyperlinks or web pages see url classify below
pictures, images review
mermaid diagrams, ascii art diagram
links to google/one drive resource
pdf, docx, file extension is clear/recognizable extension
status badges, git builds, azure build pipeline links NONE (omit from JSON)
contains localhost:[PORT] rewrite in source (italic); classify/link handling per implementation plan
ends with recognized programming language code extension (e.g. .cs, .py) sample
any other links review

url classify (apply to web / http / https URLs after scheme detection; use first matching rule in implementation’s fixed order; “anything else” → review):

url description category name
starts with https://www.npmjs.com/package/ npm package
starts with https://www.nuget.org/packages nuget package
host contains blogspot. (any TLD) blog
starts with http://blogs. blog
path ends with .pdf (case-insensitive) resource
matches GitHub repo root URL shape https://github.com/{author}/{repo} or http://github.com/{author}/{repo} with no extra path segments beyond optional trailing / github source repository
host contains mongodb.github.io technical documentation
starts with http://en.wikipedia.org/wiki knowledge article
Docker Hub image URL shape https://hub.docker.com/r/ (image path) docker image
starts with https://learn.microsoft.com knowledge article
starts with https://social.technet.microsoft.com knowledge article
starts with https://docs.microsoft.com technical documentation
host contains msdn.microsoft.com technical documentation
host contains technet.microsoft.com technical documentation
host contains developers.google.com technical documentation
starts with https://learning.oreilly.com book
starts with https://www.youtube.com or https://youtu.be YT
starts with https://gist.github.com gist
starts with https://www.slideshare.net lecture
host contains codereview.stackexchange.com SO
host contains stackoverflow.com SO
host contains experts-exchange SO
anything else (http/https) review