calibration
Your own track record, by domain. For each prediction domain you have worked in, returns accuracy, the confidence gap, and mean surprise. This is your epistemic mirror.
Signature
calibration({ since_days?: number }) → {
total_resolved: number,
window_days: number | null,
domains: Array<{
domain: string,
n: number,
accuracy: number,
correct: number,
incorrect: number,
partial: number,
mean_confidence_when_correct: number | null,
mean_confidence_when_incorrect: number | null,
confidence_gap: number | null,
brier_score: number | null,
mean_surprise: number
}>
}
| Param | Type | Required | Description |
|---|---|---|---|
since_days | int | no | Only count predictions resolved in the last N days. Default: all-time. |
Domains are returned sorted by n descending — the domains you predict in most appear first.
Example
curl "https://mneva.dev/v1/calibration" \
-H "x-mneva-key: $MNEVA_KEY"
Response:
{
"total_resolved": 12,
"window_days": null,
"domains": [
{
"domain": "schema",
"n": 7,
"accuracy": 0.86,
"correct": 6,
"incorrect": 1,
"partial": 0,
"mean_confidence_when_correct": 0.62,
"mean_confidence_when_incorrect": 0.40,
"confidence_gap": 0.22,
"brier_score": 0.12,
"mean_surprise": 0.18
},
{
"domain": "deploy",
"n": 5,
"accuracy": 0.4,
"correct": 2,
"incorrect": 3,
"partial": 0,
"mean_confidence_when_correct": 0.65,
"mean_confidence_when_incorrect": 0.88,
"confidence_gap": -0.23,
"brier_score": 0.31,
"mean_surprise": 0.45
}
]
}
Reading confidence_gap
The load-bearing column. It's the difference between mean confidence when right and mean confidence when wrong.
- Positive (e.g. +0.22) — your confidence is informative. When you were sure, you were right; when you hedged, you were less right. Trust your gut in this domain.
- Near zero (±0.05) — your confidence is uncorrelated with being right. Calibration here is essentially random; treat your gut as a coin flip until enough data accumulates to move.
- Negative (e.g. −0.23) — your confidence is anti-correlated. You were more sure when you were wrong. Discount your confidence in this domain. Or better, slow down and verify before declaring confidence.
The example above is the classic shape: schema (well-calibrated) vs deploy (over-confident on the misses). The agent should believe itself less when about to commit on deploy.
null values mean there isn't enough data in that bucket — e.g., mean_confidence_when_correct: null if no predictions in this domain have been resolved correct yet.
See also
assess— pulls atrack_recordblock sourced from this same data, scoped to a topicpredict,resolve— feed this- Instinct & Calibration — concept
Reading brier_score
The textbook calibration metric: mean((confidence - outcome)^2), where outcome is 1 / 0 / 0.5 for correct / incorrect / partial. Range 0..1, lower is better.
- 0.0 — perfect. Every confident-1.0 prediction was right, every confident-0.0 was wrong.
- 0.25 — the constant-0.5 forecaster baseline. You're providing no information past prior.
- >0.25 — you're worse than guessing 0.5 every time. Usually means you're confidently wrong.
- 1.0 — catastrophic miscalibration.
Brier complements confidence_gap: gap tells you whether confidence is informative directionally, brier tells you how far the calibration is from ideal in magnitude. A well-calibrated domain has positive gap and low brier; an over-confident-but-anti-correlated domain has negative gap and high brier.