| Guide ID | T-4 |
|---|---|
| Audience | Developers, SysAdmins |
| Prerequisites | T-1: Creating Your First Skill |
| Time | 10 minutes |
| Difficulty | Intermediate |
Calibrating Thresholds
Threshold calibration optimizes the activation threshold that determines when skills trigger. The default threshold is 0.75, but your actual usage patterns may benefit from a different value. Calibration analyzes collected activation events to find the threshold that maximizes the F1 score, balancing precision (avoiding false positives) and recall (catching true activations).
When to Calibrate
Calibration becomes useful when you notice activation behavior that is not optimal:
- Test accuracy is below 80% across your skill library
- Skills activate too aggressively (false positives — wrong skill fires for a prompt)
- Skills do not activate when expected (false negatives — correct skill stays silent)
- You have accumulated at least 75 calibration events from normal usage
Step 1: Accumulate Calibration Data
Calibration events are recorded automatically during normal skill usage. The system tracks which skill activated, whether you continued working (correct activation) or corrected the output (wrong activation), and the similarity scores for all skills. You do not need to do anything special — just use your skills normally.
Check how many events have been collected:
skill-creator benchmark
You need at least 75 events with known outcomes before calibration can run. If you see a message indicating insufficient data, continue using your skills and check back later.
Step 2: Preview Calibration
Before applying any changes, preview what calibration would do. The preview shows the current threshold, the proposed optimal threshold, and the expected improvement in F1 score:
skill-creator calibrate --preview
Example output:
Calibration Analysis
Current threshold: 0.75 (F1: 82.3%)
Optimal threshold: 0.72 (F1: 87.1%)
Improvement: +4.8%
Based on 156 calibration events
Checkpoint 1
Verify: The preview output shows both the current and optimal thresholds with their F1 scores. The improvement percentage should be positive. If the improvement is zero or negative, calibration is not needed — your current threshold is already optimal.
Step 3: Apply Calibration
If the preview looks good, apply the optimized threshold:
skill-creator calibrate
The command shows the same analysis as the preview and asks for confirmation before applying. Type Y to confirm or n to cancel. You can also use --force to skip the confirmation prompt.
The new threshold is stored in ~/.gsd-skill/calibration/threshold.json and takes effect immediately for all subsequent activation decisions.
Step 4: Verify Improvement
After applying calibration, measure the real-world impact with the benchmark command:
skill-creator benchmark --verbose
The verbose benchmark shows correlation (MCC), agreement rate, confusion matrix, and a per-skill breakdown. Look for improvement in the correlation and agreement metrics compared to before calibration.
Checkpoint 2
Verify: The benchmark shows correlation (MCC) of 85% or higher. If correlation is below 85%, the benchmark exits with code 1. Review per-skill results in verbose mode to identify which skills are dragging down accuracy.
Rollback
If calibration makes activation behavior worse, you can immediately undo the change:
skill-creator calibrate rollback
This reverts to the previous threshold value. The system keeps a complete history of all threshold changes, so you can always go back.
Viewing History
View all threshold snapshots with timestamps, F1 scores, and reasons for each change:
skill-creator calibrate history
Example output:
Threshold History
Timestamp Threshold F1 Reason
------------------------------------------------------------
> 2/5/2026, 10:30 AM 0.72 87.1% calibration
2/4/2026, 3:45 PM 0.75 82.3% calibration
2/1/2026, 9:00 AM 0.75 78.5% manual
Total: 3 snapshot(s)
The > marker indicates the currently active threshold. Each entry shows when the change was made, the threshold value, the F1 score at that time, and whether the change was from calibration, rollback, or manual adjustment.
Recommended Calibration Schedule
For most users, the following schedule works well:
- Initial calibration: After accumulating 75+ events (typically 2-4 weeks of normal usage)
- Recalibration: After adding 5+ new skills, or every 3-4 weeks if actively creating skills
- Ad-hoc: Whenever you notice a significant change in activation accuracy
What’s Next
- Detecting Conflicts — Find and resolve semantic overlap that may affect activation accuracy
- Command Reference — Full CLI reference for calibrate, benchmark, and related commands

