calibration

Your own track record, by domain. For each prediction domain you have worked in, returns accuracy, the confidence gap, and mean surprise. This is your epistemic mirror.

Signature

calibration({ since_days?: number }) → {
  total_resolved: number,
  window_days: number | null,
  domains: Array<{
    domain: string,
    n: number,
    accuracy: number,
    correct: number,
    incorrect: number,
    partial: number,
    mean_confidence_when_correct: number | null,
    mean_confidence_when_incorrect: number | null,
    confidence_gap: number | null,
    brier_score: number | null,
    mean_surprise: number
  }>
}
ParamTypeRequiredDescription
since_daysintnoOnly count predictions resolved in the last N days. Default: all-time.

Domains are returned sorted by n descending — the domains you predict in most appear first.

Example

curl "https://mneva.dev/v1/calibration" \
  -H "x-mneva-key: $MNEVA_KEY"

Response:

{
  "total_resolved": 12,
  "window_days": null,
  "domains": [
    {
      "domain": "schema",
      "n": 7,
      "accuracy": 0.86,
      "correct": 6,
      "incorrect": 1,
      "partial": 0,
      "mean_confidence_when_correct": 0.62,
      "mean_confidence_when_incorrect": 0.40,
      "confidence_gap": 0.22,
      "brier_score": 0.12,
      "mean_surprise": 0.18
    },
    {
      "domain": "deploy",
      "n": 5,
      "accuracy": 0.4,
      "correct": 2,
      "incorrect": 3,
      "partial": 0,
      "mean_confidence_when_correct": 0.65,
      "mean_confidence_when_incorrect": 0.88,
      "confidence_gap": -0.23,
      "brier_score": 0.31,
      "mean_surprise": 0.45
    }
  ]
}

Reading confidence_gap

The load-bearing column. It's the difference between mean confidence when right and mean confidence when wrong.

  • Positive (e.g. +0.22) — your confidence is informative. When you were sure, you were right; when you hedged, you were less right. Trust your gut in this domain.
  • Near zero (±0.05) — your confidence is uncorrelated with being right. Calibration here is essentially random; treat your gut as a coin flip until enough data accumulates to move.
  • Negative (e.g. −0.23) — your confidence is anti-correlated. You were more sure when you were wrong. Discount your confidence in this domain. Or better, slow down and verify before declaring confidence.

The example above is the classic shape: schema (well-calibrated) vs deploy (over-confident on the misses). The agent should believe itself less when about to commit on deploy.

null values mean there isn't enough data in that bucket — e.g., mean_confidence_when_correct: null if no predictions in this domain have been resolved correct yet.

See also

Reading brier_score

The textbook calibration metric: mean((confidence - outcome)^2), where outcome is 1 / 0 / 0.5 for correct / incorrect / partial. Range 0..1, lower is better.

  • 0.0 — perfect. Every confident-1.0 prediction was right, every confident-0.0 was wrong.
  • 0.25 — the constant-0.5 forecaster baseline. You're providing no information past prior.
  • >0.25 — you're worse than guessing 0.5 every time. Usually means you're confidently wrong.
  • 1.0 — catastrophic miscalibration.

Brier complements confidence_gap: gap tells you whether confidence is informative directionally, brier tells you how far the calibration is from ideal in magnitude. A well-calibrated domain has positive gap and low brier; an over-confident-but-anti-correlated domain has negative gap and high brier.

Was this page helpful?