Calibrating Thresholds

Documentation > Tutorials > Calibrating Thresholds

Guide IDT-4
AudienceDevelopers, SysAdmins
PrerequisitesT-1: Creating Your First Skill
Time10 minutes
DifficultyIntermediate

Calibrating Thresholds

Threshold calibration optimizes the activation threshold that determines when skills trigger. The default threshold is 0.75, but your actual usage patterns may benefit from a different value. Calibration analyzes collected activation events to find the threshold that maximizes the F1 score, balancing precision (avoiding false positives) and recall (catching true activations).

When to Calibrate

Calibration becomes useful when you notice activation behavior that is not optimal:

  • Test accuracy is below 80% across your skill library
  • Skills activate too aggressively (false positives — wrong skill fires for a prompt)
  • Skills do not activate when expected (false negatives — correct skill stays silent)
  • You have accumulated at least 75 calibration events from normal usage

Step 1: Accumulate Calibration Data

Calibration events are recorded automatically during normal skill usage. The system tracks which skill activated, whether you continued working (correct activation) or corrected the output (wrong activation), and the similarity scores for all skills. You do not need to do anything special — just use your skills normally.

Check how many events have been collected:

skill-creator benchmark

You need at least 75 events with known outcomes before calibration can run. If you see a message indicating insufficient data, continue using your skills and check back later.

Step 2: Preview Calibration

Before applying any changes, preview what calibration would do. The preview shows the current threshold, the proposed optimal threshold, and the expected improvement in F1 score:

skill-creator calibrate --preview

Example output:

Calibration Analysis

Current threshold: 0.75 (F1: 82.3%)
Optimal threshold: 0.72 (F1: 87.1%)
Improvement: +4.8%

Based on 156 calibration events

Checkpoint 1

Verify: The preview output shows both the current and optimal thresholds with their F1 scores. The improvement percentage should be positive. If the improvement is zero or negative, calibration is not needed — your current threshold is already optimal.

Step 3: Apply Calibration

If the preview looks good, apply the optimized threshold:

skill-creator calibrate

The command shows the same analysis as the preview and asks for confirmation before applying. Type Y to confirm or n to cancel. You can also use --force to skip the confirmation prompt.

The new threshold is stored in ~/.gsd-skill/calibration/threshold.json and takes effect immediately for all subsequent activation decisions.

Step 4: Verify Improvement

After applying calibration, measure the real-world impact with the benchmark command:

skill-creator benchmark --verbose

The verbose benchmark shows correlation (MCC), agreement rate, confusion matrix, and a per-skill breakdown. Look for improvement in the correlation and agreement metrics compared to before calibration.

Checkpoint 2

Verify: The benchmark shows correlation (MCC) of 85% or higher. If correlation is below 85%, the benchmark exits with code 1. Review per-skill results in verbose mode to identify which skills are dragging down accuracy.

Rollback

If calibration makes activation behavior worse, you can immediately undo the change:

skill-creator calibrate rollback

This reverts to the previous threshold value. The system keeps a complete history of all threshold changes, so you can always go back.

Viewing History

View all threshold snapshots with timestamps, F1 scores, and reasons for each change:

skill-creator calibrate history

Example output:

Threshold History

  Timestamp            Threshold    F1       Reason
  ------------------------------------------------------------
> 2/5/2026, 10:30 AM   0.72         87.1%    calibration
  2/4/2026, 3:45 PM    0.75         82.3%    calibration
  2/1/2026, 9:00 AM    0.75         78.5%    manual

Total: 3 snapshot(s)

The > marker indicates the currently active threshold. Each entry shows when the change was made, the threshold value, the F1 score at that time, and whether the change was from calibration, rollback, or manual adjustment.

Recommended Calibration Schedule

For most users, the following schedule works well:

  • Initial calibration: After accumulating 75+ events (typically 2-4 weeks of normal usage)
  • Recalibration: After adding 5+ new skills, or every 3-4 weeks if actively creating skills
  • Ad-hoc: Whenever you notice a significant change in activation accuracy

What’s Next

  • Detecting Conflicts — Find and resolve semantic overlap that may affect activation accuracy
  • Command Reference — Full CLI reference for calibrate, benchmark, and related commands