Instinct & Calibration

assess(topic) is the gut-check call — should I be careful here? Mneva's answer is not vibes. It comes from two sources, returned together.

What assess does

Given a topic the agent is about to work on, assess does two things at once:

  1. Risk weighting from memories. Semantically search the tenant's memories, find the highest-caution one related to the topic, weight risk = caution × effective_relevance. Bands: risk >= 0.34danger, risk >= 0.12caution, else clear. The verdict carries the specific memory that raised it.

  2. Track record from predictions. Find resolved predictions in the tenant brain whose embedding is semantically close to the topic (cosine ≥ 0.35), falling back to substring word-overlap for any prediction the brain has not yet embedded. Compute accuracy, mean confidence, mean surprise, and the most-surprising specific misses. Return that as track_record.

The first is "what do we know about this area?" The second is "how often have we been wrong about this kind of thing — and specifically how?"

Calibration is the receipt

Other memory products give you a verdict. Mneva gives you the receipt.

When assess says danger, it tells you which memory raised the flag and which past prediction you were confidently wrong about. The verdict is auditable end-to-end. You can argue with it.

The bands are deliberately wide. Mneva is calibrated to fire on real danger and stay quiet on noise. False alarms train an agent to ignore the signal; the rule is a missed danger is the worse failure.

A real assess

Tenant brain has one flagged memory and one resolved-incorrect prediction in the deploy area. Agent asks:

assess("deploy the new env variables to production")

Response:

{
  "topic": "deploy the new env variables to production",
  "verdict": "danger",
  "risk": 0.8,
  "why": "deploy failure: missing env var caused the last incident",
  "based_on": {
    "memory_id": 1,
    "caution": 0.8,
    "relevance": 0.6958,
    "effective_relevance": 1
  },
  "track_record": {
    "adjacent_predictions": 1,
    "accuracy": 0,
    "mean_confidence": 0.9,
    "mean_surprise": 0.9,
    "most_surprising": [{
      "id": 1,
      "prediction": "deploy will succeed on first try",
      "domain": "deploy",
      "outcome": "incorrect",
      "surprise": 0.9
    }]
  }
}

The agent was 0.9 confident the last deploy would succeed; it did not. That fact is now part of every future assess in the deploy domain. The instinct earned its line.

If there are no adjacent resolved predictions, track_record is null and the assess stands on memories alone. Mneva does not invent a track record.

Calibration as its own tool

calibration returns the full per-domain track record, separately from any specific topic:

{
  "total_resolved": 12,
  "domains": [
    {
      "domain": "deploy",
      "n": 5,
      "accuracy": 0.4,
      "mean_confidence_when_correct": 0.65,
      "mean_confidence_when_incorrect": 0.88,
      "confidence_gap": -0.23,
      "brier_score": 0.31,
      "mean_surprise": 0.45
    },
    {
      "domain": "schema",
      "n": 7,
      "accuracy": 0.86,
      "mean_confidence_when_correct": 0.62,
      "mean_confidence_when_incorrect": 0.40,
      "confidence_gap": 0.22,
      "brier_score": 0.09,
      "mean_surprise": 0.18
    }
  ]
}

The confidence_gap is the load-bearing number. Positive means your confidence is informative — it correlates with being right. Negative means your confidence is anti-correlated; discount it in this domain.

This deploy-vs-schema example is the classic shape: the agent is well-calibrated on schema (modest confidence, mostly right) but over-confident on deploy (high confidence when wrong). The calibration tool exposes the gap.

See also

Was this page helpful?