Back to Blog

Why your risk scores are inconsistent and what to do about it

RC

Risk Companion

June 16, 2026
9 min read

Key Takeaways

  • When two people score the same risk differently, the problem is almost never the matrix itself. It is that the team has never agreed on what each level actually means in practice, and that gap produces a register that reflects individual interpretation rather than shared analysis.
  • A single calibration exercise, where the whole team scores the same three or four risks independently and then compares results, will surface more disagreement than most risk managers expect, and the conversation that follows is where the real calibration happens.
  • Frameworks that define impact in concrete, organisation-specific terms, with euro figures for financial impact and day ranges for schedule impact rather than vague labels like "significant" or "moderate", are the single biggest driver of scoring consistency across a team.
  • Risk scoring inconsistency compounds over time. A register built on misaligned inputs produces misleading priorities, and decisions made from those priorities can send resources toward the wrong risks while genuinely serious ones sit quietly in amber.
  • When two experienced people score the same risk very differently, it occasionally means one of them knows something the other does not. Calibration works best when that disagreement is treated as information to surface rather than a problem to average away.

Why consistent risk scoring is harder than it looks

Ask two experienced risk managers to score the same risk, giving them the same matrix, the same framework, and the same description. You will frequently get meaningfully different answers. One scores probability 3, impact 4. The other scores probability 2, impact 2. On a 5x5 matrix, that is the difference between a risk sitting in the red band and one sitting comfortably in amber.

That gap is a design problem rather than a human failing. Risk scoring is supposed to be systematic, but the inputs it depends on, what counts as 'likely' and what qualifies as 'significant', are almost always underspecified. Teams assume agreement that was never actually reached.

The result is a register that reflects who filled in which row more than it reflects the actual risk landscape. When that register is used to prioritise, you are effectively managing whoever had the most confident interpretation of a vague scale rather than managing the risks themselves.

The real reason scores diverge

The matrix is rarely the problem. Almost every organisation has one. What sits behind the labels is where the inconsistency lives.

Take a five-level probability scale where level three is labelled something like 'possible' or 'moderate likelihood'. What does that actually mean? Once every two years, once every six months, a 30% chance, a 50% chance? If your framework does not define it, everyone fills in the gap with their own mental model, and those mental models differ because professional backgrounds differ, risk appetite differs, and experience of what has actually gone wrong in the past differs.

The same problem is even more pronounced with impact. 'High financial impact' means different things to a team running a €500.000 project and one running a €50 million programme. If the framework does not anchor 'high' to a number, say any financial loss above €250.000, then every person scoring financial impact is essentially making up their own threshold and calling it the matrix.

The same problem runs across impact perspectives: schedule, quality, reputational, HSE. Each one needs a concrete definition at each level, or every score is a guess dressed up as analysis.

The criticism here is not of the people doing the scoring but of the process that sends them into that scoring without enough shared reference points.

What calibration actually involves

Calibration is the practice of getting a team to the same starting point before they score independently. It sounds straightforward, and in practice many teams skip it entirely, discovering the problem later when a board member asks why a risk that appeared green in the register just caused a significant incident.

A practical calibration exercise looks like this:

Step one: Pick three or four real risks. Choose risks the team knows well, things that have come up in recent projects rather than abstract hypotheticals. Include at least one where you suspect scoring will differ.

Step two: Have everyone score them independently. Each person applies the matrix to each risk and records their probability and impact scores privately, without any discussion. Five minutes per risk is generally sufficient.

Step three: Put the scores on the table and compare. When one person scores a risk 3x4 and another scores it 2x2, the conversation that follows, specifically 'what were you thinking about when you chose that level?', surfaces the assumptions that were never made explicit. One person was thinking about financial exposure while another was thinking about operational disruption. Both readings are legitimate, but they are not comparable without a shared definition to anchor them.

Step four: Work backward to definitions. Once the divergences are visible, the team can agree on what each level should mean in terms of the specific organisation, project, and risk types in the room rather than in abstract terms. Write those definitions down and attach them to the framework.

Step five: Rescore and compare again. After definitions are agreed, run the same exercise a second time. Divergences should narrow significantly, and where they do not, the definition needs more work before the team starts scoring the full register.

The whole exercise takes about an hour. It is uncomfortable in a useful way, because it forces people who have been scoring in private to defend their reasoning in public. Teams find, often with some surprise, that their register has been quietly inconsistent for months.

In Risk Companion, you can run this kind of structured session with your whole team, scoring together in real time. The interactive risk sessions feature is designed for exactly this kind of workshop: everyone in the room, inputs captured directly, no notes to transcribe afterward. You leave with a calibrated team and a populated register rather than a page of disagreements to sort out later.

Fixing the framework so calibration holds over time

A calibration workshop solves the immediate problem but leaves the structural one untouched. If the framework underlying your risk matrix still has vague level definitions, the next person who joins the team will start scoring from their own assumptions, and the inconsistency creeps back in.

The fix is to treat your framework settings as a living document rather than a one-time setup task. That means:

Defining impact levels in concrete, organisation-specific terms. For financial impact, attach euro figures to each level, for schedule impact attach day ranges, and for reputational impact describe the type of exposure at each level, whether that is internal only, external with media coverage, or regulatory attention. 'Moderate reputational damage' is a placeholder rather than a definition, and placeholders produce inconsistent scores.

Defining probability levels with time horizons or frequencies. If level three means 'this has happened once in the last three years across similar projects', write that down. A percentage is useful, a frequency reference is better, and combining both gives assessors the clearest possible anchor.

Reviewing definitions when the context changes. A framework built for a €2 million construction project may not be appropriate for a €20 million one. The labels stay the same; what they anchor to should shift.

Risk Companion's framework system is built on exactly this principle. You configure probability and impact levels to match your method, your thresholds, and your organisation's language, and the matrix, scoring, and reporting all follow from that configuration so what the framework says and what the tool shows are always in sync.

This matters because consistency is what makes a register trustworthy enough to drive decisions rather than just document them. A risk score only means something if it means the same thing to the person who created it as to the person reading it, and without that shared ground the register becomes a record of individual interpretations rather than a shared picture of actual risk.

When scoring disagreement is worth paying attention to

Not all scoring divergence points to a calibration failure. Sometimes it points to something more useful: a genuine difference in what two people know about the same risk.

When two experienced people look at the same risk and score it very differently, that occasionally means one of them knows something the other has not seen yet. The calibration conversation surfaces that information as much as it aligns definitions. The person who scored financial impact higher might have seen a similar failure mode on a previous project. The person who scored probability lower might have access to controls the other person was not aware of.

Calibration works best when it is a genuine conversation instead of a process of pushing everyone toward the average. The goal is shared understanding, which is different from artificial uniformity. A team that has talked through why they disagree will produce better scores than one that has been drilled to use the same number.

The register should reflect what the team actually believes, with enough structure to make those beliefs comparable. That is the balance worth aiming for.

If you want to see what a calibrated, consistently scored register looks like in practice, Risk Companion's free 14-day trial lets you configure your own framework definitions, run a real calibration session with your team, and see how the risk matrix and scoring hold up once everyone is working from the same starting point.

Ready to improve your risk management?

See how Risk Companion can help you implement these best practices with powerful, easy-to-use tools. Sign up and we'll prepare a demo project tailored to your company.

Risk assessments
AI assistance
Bowtie models
Simulations

Frequently Asked Questions

The most common cause is that the risk framework's level definitions are too vague to anchor scores consistently. When 'high impact' or 'likely' are not defined in concrete, organisation-specific terms, each person substitutes their own mental model. Professional background, previous experience, and individual risk appetite all push scores in different directions without anyone realising it.