PRDC — Links scan, classification, and memo design
Overview
PRDC is a non-interactive tool that scans the wiki’s Markdown, classifies every link destination, persists a mergeable ledger to disk, and applies one content mutation: localhost links are replaced with italic text. Prefer a standalone Python implementation, consistent with other repo maintenance scripts under docs/superpowers/scripts/.
This document is the authoritative description of memo behavior. It refines the plan note that url is “the same string the parser extracted”: in PRDC, url is the canonical memo key after the normalization rules below (parser output is an input, not the final key).
Goals
- Scan all in-scope wiki Markdown, extract inline and reference link destinations (any scheme or relative path the Markdown layer exposes).
- Classify each extracted link per Appendix A (non-URL rules and url classify table).
- Maintain
/_memoize/LINKS.jsonacross runs with stable merge semantics and no prompts. - Rewrite
localhost:[PORT]links in source Markdown to italics on every run, with no dry-run mode and no separate CLI flag for that behavior (operator intent is “run PRDC = scan + classify + memo + localhost fix”).
Non-goals
- Building or serving the Jekyll site as part of PRDC.
- Tracking links in category
NONE(they never appear in JSON). - Interactive confirmation, TTY prompts, or per-file approval gates.
Scan scope
Include
- Markdown files that are wiki content: topic trees such as
dotnet/,architecture/,data/, and analogous content directories defined at implementation time (mirrorCLAUDE.md/STRUCTURE.mdreality).
Exclude
- GitHub / community root files at repository root: at minimum
README.md,CONTRIBUTING.md,CHANGELOG.md,CONTRIBUTORS.md,LICENSE,STRUCTURE.md, andplan*.mdscratch plans unless explicitly included later. - Agent / Claude meta (e.g.
.claude/,CLAUDE.mdif treated as meta). - Superpowers docs under
docs/superpowers/(specs, plans, reports, scripts) unless explicitly re-scoped later. - Hub
index.mdfiles anywhere under wiki content trees (including repository-rootindex.md): exclude the entire file from scanning. - Backward / forward navigation between Markdown content (the plan’s “between md content” nav patterns — define precise patterns during implementation, e.g. same-folder
[<](./index.md)/ depth-adjusted root index links, so PRDC does not ledger or classify them).
Extraction
- A link is any Markdown link destination:
https,http,mailto,tel, relative paths, other schemes, fragments, autolinks if present, including reference-style link destinations. - Parser choice is implementation detail; it must resolve reference definitions to a concrete href string before normalization.
Classification
Apply the type → category table in Appendix A (images → review, mermaid/ascii → diagram, Google/Drive → resource, clear extensions → extension, CI/badge links → NONE, programming-language sample extensions → sample, else review unless url classify matches).
url classify uses the rules in Appendix A (npm, nuget, blog, pdf/resource, GitHub repo shape, Mongo docs hosts, Wikipedia, Docker Hub, Microsoft / Google doc hosts, O’Reilly, YouTube, gist, SlideShare, Stack Exchange family → mapped categories; anything else under web classification path → review unless another rule fires first).
Ordering: Define a deterministic rule order in implementation (e.g. NONE and special cases before generic http rules) so two runs on the same tree produce the same category for the same canonical url.
Memo file
- Path:
/_memoize/LINKS.json(create/_memoize/on first run if missing). - Shape: JSON array of objects:
{ "url": "<string>", "category": "<string>" }only (nofilefield). - Serialize every in-scope extracted link except category
NONE(those never appear). - Uniqueness: at most one object per canonical
urlafter normalization. - Determinism: write JSON with stable sorting (e.g. sort by
url, thencategory) so diffs are repeatable.
Canonical url (memo key)
http / https — aggressive normalization
Input: parser-extracted href for http/https.
- Lowercase scheme and host; apply IDNA / punycode for the host when applicable.
- Strip default ports (
:443forhttps,:80forhttp). - Percent-decode path segments where decoding is idempotent and safe (no double-decoding loops). If decoding would change semantics ambiguously, keep the encoded form for that segment (document edge cases in implementation tests).
- Resolve dot segments in the path (
/./,/../) when purely hierarchical. - Path empty vs root: normalize bare
https://hostto use path/consistently. - Trailing slash: remove a single trailing
/from the path when the path has more than one character and ends with/(e.g./wiki/foo/→/wiki/foo;/remains/). - Fragment: retain
#fragmentas part of the canonical web URL (different fragments ⇒ different memo rows). - Query: retain query parameters, but remove query keys that match a built-in only blocklist (T1). Strip entire duplicate keys consistently; sort remaining keys for stable serialization. Initial built-in keys should include common trackers, including at minimum:
utm_source,utm_medium,utm_campaign,utm_term,utm_content,gclid,fbclid,msclkid,mc_eid,yclid,_ga,spm(extend in code with comments; spec lists intent, code lists exact names).
mailto / tel — light normalization (W2)
mailto:: lowercase scheme; trim ASCII whitespace around the full href; formailto:user@host, lowercase host part; preserve?subject=/&body=query semantics with the same tracking-key strip ashttpwhere applicable.tel:strip ASCII whitespace, parentheses, and hyphens from the tel-national portion while preserving leading+and digits; scheme lowercased.
These remain mailto:... / tel:... strings — not repo-relative paths.
Path-like / in-repo keys
For destinations that are not http, https, mailto, or tel (relative paths, root-anchored site paths, other schemes classified into the path pipeline):
- Resolve relative to the source Markdown file’s directory to a path under the repository root.
- Emit repo-relative POSIX paths (
folder/file.md), even on Windows (P1). - Strip
#fragmentfrom identity for these keys (F2): canonicalurlis the file path only. - If a path cannot be resolved to an existing repo file: emit a best-effort repo-relative POSIX path (after syntactic normalization) and classify as
reviewunless a higher-priority rule applies.
localhost:[PORT] mutation
- When a link destination matches
localhostwith an explicit port (pattern fixed in implementation, aligned with the plan’s intent), remove the Markdown link and replace the link text with italic using the same visible text (or the literallocalhost:portif link text is empty — edge case defined in implementation). - Apply whenever PRDC runs, in the same pass as scanning, without a dry-run default and without a dedicated “fix localhost” flag.
Merge semantics (LINKS.json)
On startup:
- If
LINKS.jsonis missing, start from an empty array. - Otherwise load existing rows.
During the run:
- Compute canonical
url→ category for every extracted in-scope link (skippingNONEfor persistence). - Insert rows for new canonical
urlvalues. - Update
categorywhen the newly computed category differs from the stored one, except:- If an existing row has
category != "review", do not re-classify; keep storedcategoryand row as-is for thaturl.
- If an existing row has
- Rows with
category == "review"are re-evaluated every run until the computed category becomes something other thanreview, at which point the row updates and then participates in the frozen non-review rule above.
Write the merged array back to LINKS.json with stable sorting.
CLI / execution
- Non-interactive: never prompt; failures print a clear message and exit non-zero.
- No dry-run mode and no separate flag for localhost rewriting (approved operator model: running PRDC applies fixes).
Exact invocation (python -m …, script name, arguments such as repo root defaulting to cwd) is left to the implementation plan phase.
Testing (minimum)
- Golden tests for normalization: representative
http/https,mailto/tel, and POSIX path keys (including fragment strip for paths, fragment keep for https). - Merge tests: frozen non-review,
reviewreassessed,NONEomitted. - Fixture Markdown snippets: inline + reference links, at least one
localhost:PORTbefore/after rewrite.
Open items for implementation plan only
- Exact directory glob list for “wiki content” vs excluded roots (derived from
STRUCTURE.md/ current tree). - Precise regex or AST patterns for nav exclusion (hub and sibling nav).
- Parser library choice and version pinning.
Appendix A: Link classification rules
Non-URL / general type → category
| type | category |
|---|---|
| urls/hyperlinks or web pages | see url classify below |
| pictures, images | review |
| mermaid diagrams, ascii art | diagram |
| links to google/one drive | resource |
| pdf, docx, file extension is clear/recognizable | extension |
| status badges, git builds, azure build pipeline links | NONE (omit from JSON) |
contains localhost:[PORT] | rewrite in source (italic); classify/link handling per implementation plan |
ends with recognized programming language code extension (e.g. .cs, .py) | sample |
| any other links | review |
url classify (apply to web / http / https URLs after scheme detection; use first matching rule in implementation’s fixed order; “anything else” → review):
| url description | category name |
|---|---|
starts with https://www.npmjs.com/package/ | npm package |
starts with https://www.nuget.org/packages | nuget package |
host contains blogspot. (any TLD) | blog |
starts with http://blogs. | blog |
path ends with .pdf (case-insensitive) | resource |
matches GitHub repo root URL shape https://github.com/{author}/{repo} or http://github.com/{author}/{repo} with no extra path segments beyond optional trailing / | github source repository |
host contains mongodb.github.io | technical documentation |
starts with http://en.wikipedia.org/wiki | knowledge article |
Docker Hub image URL shape https://hub.docker.com/r/ (image path) | docker image |
starts with https://learn.microsoft.com | knowledge article |
starts with https://social.technet.microsoft.com | knowledge article |
starts with https://docs.microsoft.com | technical documentation |
host contains msdn.microsoft.com | technical documentation |
host contains technet.microsoft.com | technical documentation |
host contains developers.google.com | technical documentation |
starts with https://learning.oreilly.com | book |
starts with https://www.youtube.com or https://youtu.be | YT |
starts with https://gist.github.com | gist |
starts with https://www.slideshare.net | lecture |
host contains codereview.stackexchange.com | SO |
host contains stackoverflow.com | SO |
host contains experts-exchange | SO |
| anything else (http/https) | review |