gsd-skill-creator · knowledge ingestion

sc:learn
Pipeline Deep Dive

How Skill Creator transforms documents, repos, and URLs into mathematical primitives — then deduplicates, wires dependencies, and generates skills, agents, and teams. A seven-stage pipeline with built-in security, provenance tracking, and human-in-the-loop safety gates.

What sc:learn does

sc:learn is the knowledge ingestion command for Skill Creator. You point it at a source — a markdown file, a PDF, a GitHub repo, a zip archive, a URL — and it runs a seven-stage pipeline that extracts structured mathematical primitives from the content, deduplicates them against your existing registry, and generates skills, agents, and teams from what it learns.

Every piece of knowledge that enters the system goes through sanitization, human approval, structural analysis, semantic deduplication, and provenance tracking. Nothing is silently overwritten. Everything is reversible via sc:unlearn.

Core Guarantee

The pipeline is designed around three safety invariants: STRANGER content never auto-approves, merge never silently overwrites, and every change is recorded in a reversible changeset.

Function Signature

async function scLearn(
  source: string,
  options: ScLearnOptions = {}
): Promise<ScLearnResult>

The function returns a ScLearnResult containing: a success flag, a unique sessionId (for unlearn/revert), a report with full provenance, a changeset (null in dry-run mode), and an errors array.

The Seven Stages

Every invocation flows through these stages in order. Early exits at stages 1 and 3 prevent bad data from reaching extraction. The pipeline broadcasts progress events at each stage via the onProgress callback.

1
ACQUIRE
Detect source type, download/extract, normalize to plain text. Tag familiarity: HOME (local files, archives) or STRANGER (GitHub, URLs).
2
SANITIZE
Run tiered hygiene checks — prompt injection, hidden characters, embedded code, external resources, path traversal, content-type mismatch. STRANGER tier gets the full scan.
3
HITL GATE
Human-In-The-Loop approval. HOME + clean → auto-approve. HOME + findings → user review. STRANGER → always requires explicit user approval, even if clean.
4
ANALYZE
Parse document structure into a section tree. Classify content type (textbook, reference, tutorial, spec, paper). Detect domain and map to Complex Plane position.
5
EXTRACT + WIRE
Run content-type-specific heuristics to extract candidate primitives. Wire dependency edges using section ordering, cross-references, and type-based inference. Cycle detection prevents graph corruption.
6
DEDUP + MERGE
Pre-filter by plane proximity and keyword overlap, then deep semantic comparison (Jaccard similarity). Merge engine: skip duplicates, update generalizations, add specializations, present conflicts for user decision.
7
GENERATE + REPORT
Group primitives by domain. Generate skill files (≥30 primitives), agent definitions, and team configurations. Produce a provenance report with full audit trail.

Acquire

The acquirer is the entry point. It detects what you gave it, downloads or extracts it, normalizes everything to plain text, and stages the results with provenance metadata.

Source Detection

The source string is analyzed to determine both the type and the familiarity — the familiarity tag drives how aggressively the sanitizer checks the content downstream.

Input PatternSource TypeFamiliarityWhat Happens
notes.md local-file HOME Read directly, normalize to text
docs.zip archive HOME Extract, filter by supported extensions, normalize each file
https://github.com/... github-url STRANGER Shallow clone, scope-filter (default: docs/ + README.md), normalize
https://example.com/paper.pdf url STRANGER Download via curl, normalize

Supported Formats

Documents

.md .markdown .txt — read as UTF-8

.pdf — BT/ET text operator extraction

.docx — unzip, strip XML from word/document.xml

.epub — unzip, concatenate all XHTML/HTML files

Archives

.zip — unzip, filter by supported extensions

.tgz .tar.gz — extract, filter, normalize

Limits: 50 MB per file, 200 MB total for archives

Staging

Every file is normalized to plain text and written to .learn-staging/ as a .staged.txt file. The staging record carries the original filename, byte size, encoding, and source path — this metadata follows the content through every subsequent stage for provenance tracking.

Early Exit

If acquisition throws (file not found, network error) or returns fatal errors (binary content in a text file), the pipeline returns success: false immediately. Non-fatal errors (unsupported file type in an archive) are collected but don't stop the pipeline.

Sanitize

The sanitizer enforces hygiene rules tiered by familiarity. STRANGER content gets the full battery of six check categories. HOME content gets a lighter scan — it still checks for prompt injection and embedded code, but trusts local files enough to skip hidden character and external resource checks.

Check Categories

CategorySeveritySTRANGERHOMEWhat It Catches
prompt-injection critical ✓ full ✓ critical only XML system tags, chat template markers (<|im_start|>, [INST]), instruction overrides, role injections
hidden-characters warning/critical Zero-width spaces, directional overrides (critical), homoglyph attacks (Cyrillic mixed with Latin)
embedded-code critical Script tags, data URIs with base64, javascript: scheme, inline event handlers, suspicious base64 blocks
external-resources critical/warning Iframes, remote src attributes, object/embed tags, CSS url() with remote resources
path-traversal critical ../ sequences, absolute paths, null bytes in filenames
content-type-mismatch warning Binary content masquerading as text (>5% null bytes)

The sanitizer outputs a HygieneReport with all findings grouped by severity. Any critical finding means the report is marked as "not passed" — the HITL gate will require explicit user approval to continue.

The autoApproved flag is only set when the source is HOME familiarity and there are zero findings of any severity. This is the one path that skips user interaction entirely.

HITL Gate

The Human-In-The-Loop gate implements three approval rules — no exceptions. The gate uses dependency injection (PromptFn) for testability, falling back to @clack/prompts for CLI interaction.

The Three Laws

HOME + Clean

Auto-approve silently. No user prompt needed. This is the fast path for trusted local content.

HOME + Findings

Present findings to user. They choose: approve, approve with warnings, or reject.

STRANGER (any)

Always requires explicit user approval, even if the hygiene report is perfectly clean.

Early Exit

If the user rejects, the pipeline returns success: true (it's not an error — it's a deliberate decision) with "Content rejected at HITL gate" in the errors array and changeset: null. No analysis, extraction, or modification occurs.

Both approved and approved-with-warnings allow the pipeline to continue. The full decision audit trail — status, rationale, who decided, timestamp — is recorded in the HITL result for downstream provenance.

Analyze

The analyzer takes each staged file's plain text and produces a structural map plus domain classification. This drives the extractor's heuristic selection in the next stage.

Structure Extraction

Markdown headings (# through ######) and numbered sections (1.1, 2.3.1) are parsed into a hierarchical tree of DocumentSection objects using a stack-based nesting algorithm. Each section carries its title, level, content, children, and character offset in the original document.

If no headings are found, the entire content is wrapped in a single root section titled "Document."

Content Type Classification

The analyzer scores content against five pattern families using weighted regex match counts:

TypeWeightTrigger Patterns
textbook ×3 Definition, Theorem, Proof, Lemma, Corollary, Example, Exercise
reference ×2 API, endpoint, function, parameter, return, method, class, interface, code fences
tutorial ×2 Step N, try this, exercise, hands-on, build a, create a, practice, learn
spec ×3 MUST, SHALL, SHOULD, requirement, constraint, specification
paper ×2 Abstract, Introduction, Methodology, Results, Conclusion, References, Discussion

Below a cumulative score of 2, the content is classified as unknown — the extractor will use all strategies simultaneously and cap confidence at 0.4.

Domain Detection & Complex Plane

When domain definitions are provided, keywords are scored against activation patterns using the formula:

score = min(1.0, sqrt(matchCount / totalPatterns) × (1 + 0.15 × matchCount))

Domains exceeding the activation threshold (0.1) are used to compute a weighted centroid position on the Complex Plane. This position follows the primitive through deduplication — it's how the prefilter quickly identifies potential duplicates by proximity.

Extract + Wire

This is where raw text becomes structured mathematical primitives. The extractor uses content-type-specific heuristics to find definitions, theorems, algorithms, techniques, and axioms. Then the dependency wirer connects them.

Content-Type Heuristics

Textbook

Definitions, Theorems, Algorithms, Identities

Regex patterns for "Definition 1.1:", "Theorem (Name):", "Algorithm:", "Identity:" — each mapped to the corresponding primitive type at 0.9 confidence.

Tutorial

Steps, Key Concepts, Exercises

"Step N:" → technique, "Key Concept:" → definition (0.8), "Exercise:" → technique (0.7). Lower confidence reflects the informal structure.

Spec

Requirements, Interfaces, Invariants

"MUST/SHALL" → axiom (0.9), "Interface/Service:" → definition (0.8), "Invariant/Constraint:" → axiom (0.85). RFC-style documents map naturally.

Paper

Algorithms, Findings, Methods, Hypotheses

"Algorithm N:" → algorithm, "Finding/Result:" → theorem (0.8), "Method:" → technique (0.8), "Hypothesis:" → theorem (0.7).

Reference

Functions, Classes, Patterns, Rules

Function/method signatures → definition (0.85), class/interface/type → definition (0.8), "Pattern:" → technique, "Rule:" → axiom.

Unknown

All Strategies at Once

Runs every heuristic, deduplicates by offset proximity (10 chars), keeps highest confidence per location, then caps all confidence at 0.4.

Each Candidate Primitive Gets

A unique ID (domain + slugified name), a formal statement, a computational form (derived per type — e.g., definitions get "Defines: ...", theorems get "Given preconditions, then: ..."), the source section and offset, extraction confidence, applicability patterns (top section keywords), and keywords from the statement text. Dependency arrays start empty.

Dependency Wiring — Four Passes

After extraction, the wirer fills in dependencies, enables, and prerequisites arrays by analyzing relationships between candidates:

Pass 1 — Section Ordering

Candidates sorted by document offset. Each gets a motivates edge (strength 0.3) to the preceding primitive. Axioms are excluded — they're foundational and shouldn't depend on what came before.

Pass 2 — Cross-References

Scan each primitive's formal statement for names of other primitives. If a relationship keyword is found (requires, generalizes, specializes, applies), the matching edge type is used. Otherwise defaults to requires at strength 0.5.

Pass 3 — Type-Based Inference

Algorithms and techniques that textually reference definitions or theorems get an applies edge (strength 0.6). This captures the common pattern of a procedure applying a theoretical result.

Pass 4 — Enables Reverse-Index

For every dependency edge A → B, adds A's ID to B's enables array. This is the reverse lookup: "what does this primitive enable?"

Cycle Prevention

Before adding any edge, a BFS from the target checks if it can reach the source via existing edges. If so, the edge is rejected. Self-references and duplicate edges are also rejected. This guarantees the dependency graph is a DAG.

Dedup + Merge

The deduplication pipeline runs each candidate against the existing registry in two tiers — a fast prefilter to eliminate obviously-new content cheaply, then deep semantic comparison only for realistic duplicate candidates.

Tier 1 — Pre-filter

Two conditions must both be true for a candidate to be flagged (AND logic):

// Both must pass for a candidate to be flagged as potential duplicate
planeDistance(candidate, existing) <= 0.2   // Euclidean distance on Complex Plane
sharedKeywords(candidate, existing) >= 2     // Case-insensitive keyword intersection

Candidates that don't match anything skip straight to the merge engine as genuinely-new.

Tier 2 — Semantic Comparison

Flagged candidates get deep comparison using Jaccard similarity on tokenized text. Each candidate-existing pair is classified into one of five classes:

ClassificationCriteriaMerge Action
exact-duplicate Formal statement ≥ 85% AND computational form ≥ 85% skip — already have it
generalization Candidate has ≤ prerequisites AND broader coverage (superset patterns or more keywords) update — take candidate's statement, union patterns/keywords, intersect prerequisites
specialization Candidate has more prerequisites AND narrower coverage (subset patterns) add — add with a specializes dependency edge to existing
overlapping-distinct Some overlap (formal ≥ 15% or keywords ≥ 20%) but doesn't meet other thresholds conflict — queued for user decision
genuinely-new No significant overlap add — add as new primitive

Merge Engine Safety

Critical Safety Invariant

The merge() function never produces a 'replace' action. Only resolveConflict() can, and only after explicit user decision (keep-existing, keep-candidate, or keep-both). This enforces the "never silently overwrites" guarantee.

Generalization Merge Strategy

When a candidate generalizes an existing primitive, the merge is mathematically precise:

// The generalized primitive takes the broader scope
formalStatement    ← candidate.formalStatement       // Candidate's broader statement
computationalForm  ← candidate.computationalForm     // Candidate's broader form
applicabilityPatterns ← union(existing, candidate)    // Both sets of patterns
keywords           ← union(existing, candidate)       // Both sets of keywords
prerequisites      ← intersect(existing, candidate)  // Only shared prerequisites (relaxed)

This preserves the existing primitive's ID and metadata while broadening its scope — the union of patterns widens applicability, and the intersection of prerequisites relaxes conditions.

Changeset Recording

Unless dryRun is true, every modification (add, update, remove) is recorded in the changeset manager with before/after states. This enables sc:unlearn to cleanly revert all changes from a session. Revert operations run in reverse order and validate graph integrity before executing — if removing a primitive would create dangling references, the revert is blocked unless force=true.

Generate + Report

The final stage groups merged primitives by domain and generates artifacts when the domain crosses a threshold. Only primitives from add or update actions are included — skipped and conflicted primitives don't generate artifacts.

Generated Artifacts

Skill

≥ 30 primitives

YAML frontmatter, summary, top 10 key primitives ranked by importance (enables + composition rules), composition patterns, activation keywords.

skills/learn/learn-{domain}/SKILL.md

Agent

≥ 30 primitives

An agent definition combining the domain's primitives into a specialized role with the appropriate Complex Plane position center.

Team

≥ 50 primitives

Multi-agent team configuration. Higher threshold reflects the richer primitive set needed for meaningful multi-agent coordination.

Importance Scoring

Key primitives are ranked by importance score: the count of downstream enablements plus the count of composition rules. This is a proxy for centrality — primitives that enable many others and participate in compositions are the most valuable teaching material for the skill.

importanceScore(p) = p.enables.length + p.compositionRules.length

Learning Report

Every session produces a full provenance report containing: session ID, source path, timing (started/completed), the entire provenance chain (every merge decision with rationale), counts of primitives added/updated/skipped/conflicted, and lists of generated skills, agents, and teams.

Safety Architecture

Every layer of the pipeline is designed around explicit safety invariants. Here's the complete picture:

Invariant 1

STRANGER Content Never Auto-Approves

Internet-sourced content (GitHub repos, URLs) always gets the full sanitization scan and always requires explicit user approval at the HITL gate — even if zero issues are found. Trust is earned, not assumed.

Invariant 2

Merge Never Silently Overwrites

The merge() function cannot produce a 'replace' action. Only resolveConflict() can, and only after the user explicitly chooses "keep-candidate." Overlapping-distinct content is always presented as a conflict for human judgment.

Invariant 3

Every Change is Reversible

The changeset manager records before/after states for every add, update, and remove. Reverting plays back operations in reverse order. Graph integrity is validated before execution — dangling references block revert unless force-overridden.

Invariant 4

Dependency Graph Stays Acyclic

BFS cycle detection runs before every edge addition. Self-references and duplicate edges are rejected. The wirer deep-copies candidates to avoid mutating the original extraction results.

Invariant 5

Full Provenance Chain

Every merge decision — skip, update, add, conflict — is recorded with a timestamp, session ID, candidate/existing IDs, action description, rationale, and (for generalizations) the original vs new formal statements. Nothing happens without a trace.

Options Reference

OptionTypeDefaultEffect
domain string auto-detected Override detected domain. Passed through to the extractor — all candidates get this domain ID.
depth 'shallow' | 'standard' | 'deep' 'standard' shallow skips dependency wiring entirely. standard and deep run all four wiring passes.
dryRun boolean false Run the full pipeline without recording modifications. Changeset returns null. Report is still generated.
scope string[] ['docs/', 'README.md'] For GitHub sources: which directories/files to include. Paths ending with / match directory prefixes.
existingPrimitives MathematicalPrimitive[] [] Your current registry. Used by the dedup prefilter and semantic comparator to find duplicates.
existingDomainCenters PlanePosition[] [] Current domain center positions on the Complex Plane. Used by the agent generator.
promptFn PromptFn @clack/prompts Inject a custom approval function for the HITL gate. Signature: (message, choices) → Promise<string>
onProgress (stage, detail) → void Callback fired at the start of each pipeline stage. Stages: acquire, sanitize, hitl, analyze, extract, dedup, generate, report.

Typical Usage

// Learn from a local textbook
const result = await scLearn('linear-algebra.pdf', {
  domain: 'linear-algebra',
  depth: 'deep',
});

// Dry-run against a GitHub repo
const preview = await scLearn('https://github.com/user/docs', {
  scope: ['docs/', 'guides/', 'README.md'],
  dryRun: true,
});

// Learn with progress reporting
const result = await scLearn('notes.zip', {
  onProgress: (stage, detail) => console.log(`[${stage}]`, detail),
});

File Map

src/commands/
  sc-learn.ts              ← Orchestrator (this guide)
  sc-learn.test.ts         ← 20+ test cases covering all paths
  sc-unlearn.ts            ← Revert command using changeset manager

src/learn/
  acquirer.ts              ← Stage 1: source detection, download, normalize
  sanitizer.ts             ← Stage 2: tiered hygiene checks
  hitl-gate.ts             ← Stage 3: human-in-the-loop approval
  analyzer.ts              ← Stage 4: structure + content type + domain
  extractor.ts             ← Stage 5a: heuristic primitive extraction
  dependency-wirer.ts      ← Stage 5b: four-pass dependency wiring
  dedup-prefilter.ts       ← Stage 6a: fast plane+keyword screening
  semantic-comparator.ts   ← Stage 6b: deep Jaccard classification
  merge-engine.ts          ← Stage 6c: merge strategies + conflict mgmt
  changeset-manager.ts     ← Reversible changeset recording
  report-generator.ts      ← Stage 7: provenance report generation
  generators/
    skill-generator.ts     ← Learned skill file output
    agent-generator.ts     ← Agent definition output
    team-generator.ts      ← Team configuration output