How Skill Creator transforms documents, repos, and URLs into mathematical primitives — then deduplicates, wires dependencies, and generates skills, agents, and teams. A seven-stage pipeline with built-in security, provenance tracking, and human-in-the-loop safety gates.
sc:learn is the knowledge ingestion command for Skill Creator. You point it at a source — a markdown file, a PDF, a GitHub repo, a zip archive, a URL — and it runs a seven-stage pipeline that extracts structured mathematical primitives from the content, deduplicates them against your existing registry, and generates skills, agents, and teams from what it learns.
Every piece of knowledge that enters the system goes through sanitization, human approval, structural analysis, semantic deduplication, and provenance tracking. Nothing is silently overwritten. Everything is reversible via sc:unlearn.
The pipeline is designed around three safety invariants: STRANGER content never auto-approves, merge never silently overwrites, and every change is recorded in a reversible changeset.
async function scLearn(
source: string,
options: ScLearnOptions = {}
): Promise<ScLearnResult>
The function returns a ScLearnResult containing: a success flag, a unique sessionId (for unlearn/revert), a report with full provenance, a changeset (null in dry-run mode), and an errors array.
Every invocation flows through these stages in order. Early exits at stages 1 and 3 prevent bad data from reaching extraction. The pipeline broadcasts progress events at each stage via the onProgress callback.
The acquirer is the entry point. It detects what you gave it, downloads or extracts it, normalizes everything to plain text, and stages the results with provenance metadata.
The source string is analyzed to determine both the type and the familiarity — the familiarity tag drives how aggressively the sanitizer checks the content downstream.
| Input Pattern | Source Type | Familiarity | What Happens |
|---|---|---|---|
notes.md |
local-file | HOME | Read directly, normalize to text |
docs.zip |
archive | HOME | Extract, filter by supported extensions, normalize each file |
https://github.com/... |
github-url | STRANGER | Shallow clone, scope-filter (default: docs/ + README.md), normalize |
https://example.com/paper.pdf |
url | STRANGER | Download via curl, normalize |
.md .markdown .txt — read as UTF-8
.pdf — BT/ET text operator extraction
.docx — unzip, strip XML from word/document.xml
.epub — unzip, concatenate all XHTML/HTML files
.zip — unzip, filter by supported extensions
.tgz .tar.gz — extract, filter, normalize
Limits: 50 MB per file, 200 MB total for archives
Every file is normalized to plain text and written to .learn-staging/ as a .staged.txt file. The staging record carries the original filename, byte size, encoding, and source path — this metadata follows the content through every subsequent stage for provenance tracking.
If acquisition throws (file not found, network error) or returns fatal errors (binary content in a text file), the pipeline returns success: false immediately. Non-fatal errors (unsupported file type in an archive) are collected but don't stop the pipeline.
The sanitizer enforces hygiene rules tiered by familiarity. STRANGER content gets the full battery of six check categories. HOME content gets a lighter scan — it still checks for prompt injection and embedded code, but trusts local files enough to skip hidden character and external resource checks.
| Category | Severity | STRANGER | HOME | What It Catches |
|---|---|---|---|---|
prompt-injection |
critical | ✓ full | ✓ critical only | XML system tags, chat template markers (<|im_start|>, [INST]), instruction overrides, role injections |
hidden-characters |
warning/critical | ✓ | — | Zero-width spaces, directional overrides (critical), homoglyph attacks (Cyrillic mixed with Latin) |
embedded-code |
critical | ✓ | ✓ | Script tags, data URIs with base64, javascript: scheme, inline event handlers, suspicious base64 blocks |
external-resources |
critical/warning | ✓ | — | Iframes, remote src attributes, object/embed tags, CSS url() with remote resources |
path-traversal |
critical | ✓ | ✓ | ../ sequences, absolute paths, null bytes in filenames |
content-type-mismatch |
warning | ✓ | ✓ | Binary content masquerading as text (>5% null bytes) |
The sanitizer outputs a HygieneReport with all findings grouped by severity. Any critical finding means the report is marked as "not passed" — the HITL gate will require explicit user approval to continue.
The autoApproved flag is only set when the source is HOME familiarity and there are zero findings of any severity. This is the one path that skips user interaction entirely.
The Human-In-The-Loop gate implements three approval rules — no exceptions. The gate uses dependency injection (PromptFn) for testability, falling back to @clack/prompts for CLI interaction.
Auto-approve silently. No user prompt needed. This is the fast path for trusted local content.
Present findings to user. They choose: approve, approve with warnings, or reject.
Always requires explicit user approval, even if the hygiene report is perfectly clean.
If the user rejects, the pipeline returns success: true (it's not an error — it's a deliberate decision) with "Content rejected at HITL gate" in the errors array and changeset: null. No analysis, extraction, or modification occurs.
Both approved and approved-with-warnings allow the pipeline to continue. The full decision audit trail — status, rationale, who decided, timestamp — is recorded in the HITL result for downstream provenance.
The analyzer takes each staged file's plain text and produces a structural map plus domain classification. This drives the extractor's heuristic selection in the next stage.
Markdown headings (# through ######) and numbered sections (1.1, 2.3.1) are parsed into a hierarchical tree of DocumentSection objects using a stack-based nesting algorithm. Each section carries its title, level, content, children, and character offset in the original document.
If no headings are found, the entire content is wrapped in a single root section titled "Document."
The analyzer scores content against five pattern families using weighted regex match counts:
| Type | Weight | Trigger Patterns |
|---|---|---|
| textbook | ×3 | Definition, Theorem, Proof, Lemma, Corollary, Example, Exercise |
| reference | ×2 | API, endpoint, function, parameter, return, method, class, interface, code fences |
| tutorial | ×2 | Step N, try this, exercise, hands-on, build a, create a, practice, learn |
| spec | ×3 | MUST, SHALL, SHOULD, requirement, constraint, specification |
| paper | ×2 | Abstract, Introduction, Methodology, Results, Conclusion, References, Discussion |
Below a cumulative score of 2, the content is classified as unknown — the extractor will use all strategies simultaneously and cap confidence at 0.4.
When domain definitions are provided, keywords are scored against activation patterns using the formula:
score = min(1.0, sqrt(matchCount / totalPatterns) × (1 + 0.15 × matchCount))
Domains exceeding the activation threshold (0.1) are used to compute a weighted centroid position on the Complex Plane. This position follows the primitive through deduplication — it's how the prefilter quickly identifies potential duplicates by proximity.
This is where raw text becomes structured mathematical primitives. The extractor uses content-type-specific heuristics to find definitions, theorems, algorithms, techniques, and axioms. Then the dependency wirer connects them.
Regex patterns for "Definition 1.1:", "Theorem (Name):", "Algorithm:", "Identity:" — each mapped to the corresponding primitive type at 0.9 confidence.
"Step N:" → technique, "Key Concept:" → definition (0.8), "Exercise:" → technique (0.7). Lower confidence reflects the informal structure.
"MUST/SHALL" → axiom (0.9), "Interface/Service:" → definition (0.8), "Invariant/Constraint:" → axiom (0.85). RFC-style documents map naturally.
"Algorithm N:" → algorithm, "Finding/Result:" → theorem (0.8), "Method:" → technique (0.8), "Hypothesis:" → theorem (0.7).
Function/method signatures → definition (0.85), class/interface/type → definition (0.8), "Pattern:" → technique, "Rule:" → axiom.
Runs every heuristic, deduplicates by offset proximity (10 chars), keeps highest confidence per location, then caps all confidence at 0.4.
A unique ID (domain + slugified name), a formal statement, a computational form (derived per type — e.g., definitions get "Defines: ...", theorems get "Given preconditions, then: ..."), the source section and offset, extraction confidence, applicability patterns (top section keywords), and keywords from the statement text. Dependency arrays start empty.
After extraction, the wirer fills in dependencies, enables, and prerequisites arrays by analyzing relationships between candidates:
Candidates sorted by document offset. Each gets a motivates edge (strength 0.3) to the preceding primitive. Axioms are excluded — they're foundational and shouldn't depend on what came before.
Scan each primitive's formal statement for names of other primitives. If a relationship keyword is found (requires, generalizes, specializes, applies), the matching edge type is used. Otherwise defaults to requires at strength 0.5.
Algorithms and techniques that textually reference definitions or theorems get an applies edge (strength 0.6). This captures the common pattern of a procedure applying a theoretical result.
For every dependency edge A → B, adds A's ID to B's enables array. This is the reverse lookup: "what does this primitive enable?"
Before adding any edge, a BFS from the target checks if it can reach the source via existing edges. If so, the edge is rejected. Self-references and duplicate edges are also rejected. This guarantees the dependency graph is a DAG.
The deduplication pipeline runs each candidate against the existing registry in two tiers — a fast prefilter to eliminate obviously-new content cheaply, then deep semantic comparison only for realistic duplicate candidates.
Two conditions must both be true for a candidate to be flagged (AND logic):
// Both must pass for a candidate to be flagged as potential duplicate
planeDistance(candidate, existing) <= 0.2 // Euclidean distance on Complex Plane
sharedKeywords(candidate, existing) >= 2 // Case-insensitive keyword intersection
Candidates that don't match anything skip straight to the merge engine as genuinely-new.
Flagged candidates get deep comparison using Jaccard similarity on tokenized text. Each candidate-existing pair is classified into one of five classes:
| Classification | Criteria | Merge Action |
|---|---|---|
| exact-duplicate | Formal statement ≥ 85% AND computational form ≥ 85% | skip — already have it |
| generalization | Candidate has ≤ prerequisites AND broader coverage (superset patterns or more keywords) | update — take candidate's statement, union patterns/keywords, intersect prerequisites |
| specialization | Candidate has more prerequisites AND narrower coverage (subset patterns) | add — add with a specializes dependency edge to existing |
| overlapping-distinct | Some overlap (formal ≥ 15% or keywords ≥ 20%) but doesn't meet other thresholds | conflict — queued for user decision |
| genuinely-new | No significant overlap | add — add as new primitive |
The merge() function never produces a 'replace' action. Only resolveConflict() can, and only after explicit user decision (keep-existing, keep-candidate, or keep-both). This enforces the "never silently overwrites" guarantee.
When a candidate generalizes an existing primitive, the merge is mathematically precise:
// The generalized primitive takes the broader scope
formalStatement ← candidate.formalStatement // Candidate's broader statement
computationalForm ← candidate.computationalForm // Candidate's broader form
applicabilityPatterns ← union(existing, candidate) // Both sets of patterns
keywords ← union(existing, candidate) // Both sets of keywords
prerequisites ← intersect(existing, candidate) // Only shared prerequisites (relaxed)
This preserves the existing primitive's ID and metadata while broadening its scope — the union of patterns widens applicability, and the intersection of prerequisites relaxes conditions.
Unless dryRun is true, every modification (add, update, remove) is recorded in the changeset manager with before/after states. This enables sc:unlearn to cleanly revert all changes from a session. Revert operations run in reverse order and validate graph integrity before executing — if removing a primitive would create dangling references, the revert is blocked unless force=true.
The final stage groups merged primitives by domain and generates artifacts when the domain crosses a threshold. Only primitives from add or update actions are included — skipped and conflicted primitives don't generate artifacts.
YAML frontmatter, summary, top 10 key primitives ranked by importance (enables + composition rules), composition patterns, activation keywords.
skills/learn/learn-{domain}/SKILL.md
An agent definition combining the domain's primitives into a specialized role with the appropriate Complex Plane position center.
Multi-agent team configuration. Higher threshold reflects the richer primitive set needed for meaningful multi-agent coordination.
Key primitives are ranked by importance score: the count of downstream enablements plus the count of composition rules. This is a proxy for centrality — primitives that enable many others and participate in compositions are the most valuable teaching material for the skill.
importanceScore(p) = p.enables.length + p.compositionRules.length
Every session produces a full provenance report containing: session ID, source path, timing (started/completed), the entire provenance chain (every merge decision with rationale), counts of primitives added/updated/skipped/conflicted, and lists of generated skills, agents, and teams.
Every layer of the pipeline is designed around explicit safety invariants. Here's the complete picture:
Internet-sourced content (GitHub repos, URLs) always gets the full sanitization scan and always requires explicit user approval at the HITL gate — even if zero issues are found. Trust is earned, not assumed.
The merge() function cannot produce a 'replace' action. Only resolveConflict() can, and only after the user explicitly chooses "keep-candidate." Overlapping-distinct content is always presented as a conflict for human judgment.
The changeset manager records before/after states for every add, update, and remove. Reverting plays back operations in reverse order. Graph integrity is validated before execution — dangling references block revert unless force-overridden.
BFS cycle detection runs before every edge addition. Self-references and duplicate edges are rejected. The wirer deep-copies candidates to avoid mutating the original extraction results.
Every merge decision — skip, update, add, conflict — is recorded with a timestamp, session ID, candidate/existing IDs, action description, rationale, and (for generalizations) the original vs new formal statements. Nothing happens without a trace.
| Option | Type | Default | Effect |
|---|---|---|---|
domain |
string |
auto-detected | Override detected domain. Passed through to the extractor — all candidates get this domain ID. |
depth |
'shallow' | 'standard' | 'deep' |
'standard' |
shallow skips dependency wiring entirely. standard and deep run all four wiring passes. |
dryRun |
boolean |
false |
Run the full pipeline without recording modifications. Changeset returns null. Report is still generated. |
scope |
string[] |
['docs/', 'README.md'] |
For GitHub sources: which directories/files to include. Paths ending with / match directory prefixes. |
existingPrimitives |
MathematicalPrimitive[] |
[] |
Your current registry. Used by the dedup prefilter and semantic comparator to find duplicates. |
existingDomainCenters |
PlanePosition[] |
[] |
Current domain center positions on the Complex Plane. Used by the agent generator. |
promptFn |
PromptFn |
@clack/prompts | Inject a custom approval function for the HITL gate. Signature: (message, choices) → Promise<string> |
onProgress |
(stage, detail) → void |
— | Callback fired at the start of each pipeline stage. Stages: acquire, sanitize, hitl, analyze, extract, dedup, generate, report. |
// Learn from a local textbook
const result = await scLearn('linear-algebra.pdf', {
domain: 'linear-algebra',
depth: 'deep',
});
// Dry-run against a GitHub repo
const preview = await scLearn('https://github.com/user/docs', {
scope: ['docs/', 'guides/', 'README.md'],
dryRun: true,
});
// Learn with progress reporting
const result = await scLearn('notes.zip', {
onProgress: (stage, detail) => console.log(`[${stage}]`, detail),
});
src/commands/
sc-learn.ts ← Orchestrator (this guide)
sc-learn.test.ts ← 20+ test cases covering all paths
sc-unlearn.ts ← Revert command using changeset manager
src/learn/
acquirer.ts ← Stage 1: source detection, download, normalize
sanitizer.ts ← Stage 2: tiered hygiene checks
hitl-gate.ts ← Stage 3: human-in-the-loop approval
analyzer.ts ← Stage 4: structure + content type + domain
extractor.ts ← Stage 5a: heuristic primitive extraction
dependency-wirer.ts ← Stage 5b: four-pass dependency wiring
dedup-prefilter.ts ← Stage 6a: fast plane+keyword screening
semantic-comparator.ts ← Stage 6b: deep Jaccard classification
merge-engine.ts ← Stage 6c: merge strategies + conflict mgmt
changeset-manager.ts ← Reversible changeset recording
report-generator.ts ← Stage 7: provenance report generation
generators/
skill-generator.ts ← Learned skill file output
agent-generator.ts ← Agent definition output
team-generator.ts ← Team configuration output