Why do different people score the same risk differently?

The most common cause is that the risk framework's level definitions are too vague to anchor scores consistently. When 'high impact' or 'likely' are not defined in concrete, organisation-specific terms, each person substitutes their own mental model. Professional background, previous experience, and individual risk appetite all push scores in different directions without anyone realising it.

What is risk scoring calibration and how does it work?

Calibration is the process of aligning a team's interpretation of a risk matrix before they score independently. In practice, it means having everyone score the same set of known risks privately, then comparing results and discussing the assumptions behind each score. The differences that emerge point directly to the definitions that need to be made more precise in your framework.

How do I make my risk matrix scoring more consistent across a team?

Start by attaching concrete, organisation-specific definitions to every probability and impact level in your framework. For financial impact, that means euro thresholds. For probability, it means frequency ranges or time horizons. Then run a calibration exercise where the team scores real risks independently and compares results. Inconsistencies narrow significantly once definitions are specific rather than descriptive.

How often should a team run a risk scoring calibration exercise?

At the start of any new project or when new team members join. A calibration exercise is also worth running after a significant incident or near-miss, since those events often reveal that the scoring at the time did not reflect the actual level of risk. Annual calibration as part of a framework review is a reasonable minimum for ongoing programmes.

Does inconsistent risk scoring really affect decisions?

Yes, and more than most teams realise. A register built on misaligned inputs produces a distorted priority order. The risks at the top of the list may not be the ones that actually pose the greatest exposure — they may just be the ones scored by the most risk-averse person on the team. Decisions about where to focus measures, where to report, and what to escalate all flow from those priority rankings.

What is the difference between a risk framework and a risk matrix?

The risk matrix is the visual grid that plots probability against impact. The framework is everything that defines how the matrix works: what each probability level means, how each impact perspective is measured, what thresholds apply at each level, and how the resulting scores are calculated. A matrix without a well-defined framework behind it is just a grid — it cannot produce consistent scores on its own.

Why Your Risk Scores Are Inconsistent and How to Fix It

Why consistent risk scoring is harder than it looks

Ask two experienced risk managers to score the same risk, giving them the same matrix, the same framework, and the same description. You will frequently get meaningfully different answers. One scores probability 3, impact 4. The other scores probability 2, impact 2. On a 5x5 matrix, that is the difference between a risk sitting in the red band and one sitting comfortably in amber.

That gap is a design problem rather than a human failing. Risk scoring is supposed to be systematic, but the inputs it depends on, what counts as 'likely' and what qualifies as 'significant', are almost always underspecified. Teams assume agreement that was never actually reached.

The result is a register that reflects who filled in which row more than it reflects the actual risk landscape. When that register is used to prioritise, you are effectively managing whoever had the most confident interpretation of a vague scale rather than managing the risks themselves.

The real reason scores diverge

The matrix is rarely the problem. Almost every organisation has one. What sits behind the labels is where the inconsistency lives.

Take a five-level probability scale where level three is labelled something like 'possible' or 'moderate likelihood'. What does that actually mean? Once every two years, once every six months, a 30% chance, a 50% chance? If your framework does not define it, everyone fills in the gap with their own mental model, and those mental models differ because professional backgrounds differ, risk appetite differs, and experience of what has actually gone wrong in the past differs.

The same problem is even more pronounced with impact. 'High financial impact' means different things to a team running a €500.000 project and one running a €50 million programme. If the framework does not anchor 'high' to a number, say any financial loss above €250.000, then every person scoring financial impact is essentially making up their own threshold and calling it the matrix.

The same problem runs across impact perspectives: schedule, quality, reputational, HSE. Each one needs a concrete definition at each level, or every score is a guess dressed up as analysis.

The criticism here is not of the people doing the scoring but of the process that sends them into that scoring without enough shared reference points.

What calibration actually involves

Calibration is the practice of getting a team to the same starting point before they score independently. It sounds straightforward, and in practice many teams skip it entirely, discovering the problem later when a board member asks why a risk that appeared green in the register just caused a significant incident.

A practical calibration exercise looks like this:

Step one: Pick three or four real risks. Choose risks the team knows well, things that have come up in recent projects rather than abstract hypotheticals. Include at least one where you suspect scoring will differ.

Step two: Have everyone score them independently. Each person applies the matrix to each risk and records their probability and impact scores privately, without any discussion. Five minutes per risk is generally sufficient.

Step three: Put the scores on the table and compare. When one person scores a risk 3x4 and another scores it 2x2, the conversation that follows, specifically 'what were you thinking about when you chose that level?', surfaces the assumptions that were never made explicit. One person was thinking about financial exposure while another was thinking about operational disruption. Both readings are legitimate, but they are not comparable without a shared definition to anchor them.

Step four: Work backward to definitions. Once the divergences are visible, the team can agree on what each level should mean in terms of the specific organisation, project, and risk types in the room rather than in abstract terms. Write those definitions down and attach them to the framework.

Step five: Rescore and compare again. After definitions are agreed, run the same exercise a second time. Divergences should narrow significantly, and where they do not, the definition needs more work before the team starts scoring the full register.

The whole exercise takes about an hour. It is uncomfortable in a useful way, because it forces people who have been scoring in private to defend their reasoning in public. Teams find, often with some surprise, that their register has been quietly inconsistent for months.

In Risk Companion, you can run this kind of structured session with your whole team, scoring together in real time. The interactive risk sessions feature is designed for exactly this kind of workshop: everyone in the room, inputs captured directly, no notes to transcribe afterward. You leave with a calibrated team and a populated register rather than a page of disagreements to sort out later.

Fixing the framework so calibration holds over time

A calibration workshop solves the immediate problem but leaves the structural one untouched. If the framework underlying your risk matrix still has vague level definitions, the next person who joins the team will start scoring from their own assumptions, and the inconsistency creeps back in.

The fix is to treat your framework settings as a living document rather than a one-time setup task. That means:

Defining impact levels in concrete, organisation-specific terms. For financial impact, attach euro figures to each level, for schedule impact attach day ranges, and for reputational impact describe the type of exposure at each level, whether that is internal only, external with media coverage, or regulatory attention. 'Moderate reputational damage' is a placeholder rather than a definition, and placeholders produce inconsistent scores.

Defining probability levels with time horizons or frequencies. If level three means 'this has happened once in the last three years across similar projects', write that down. A percentage is useful, a frequency reference is better, and combining both gives assessors the clearest possible anchor.

Reviewing definitions when the context changes. A framework built for a €2 million construction project may not be appropriate for a €20 million one. The labels stay the same; what they anchor to should shift.

Risk Companion's framework system is built on exactly this principle. You configure probability and impact levels to match your method, your thresholds, and your organisation's language, and the matrix, scoring, and reporting all follow from that configuration so what the framework says and what the tool shows are always in sync.

This matters because consistency is what makes a register trustworthy enough to drive decisions rather than just document them. A risk score only means something if it means the same thing to the person who created it as to the person reading it, and without that shared ground the register becomes a record of individual interpretations rather than a shared picture of actual risk.

When scoring disagreement is worth paying attention to

Not all scoring divergence points to a calibration failure. Sometimes it points to something more useful: a genuine difference in what two people know about the same risk.

When two experienced people look at the same risk and score it very differently, that occasionally means one of them knows something the other has not seen yet. The calibration conversation surfaces that information as much as it aligns definitions. The person who scored financial impact higher might have seen a similar failure mode on a previous project. The person who scored probability lower might have access to controls the other person was not aware of.

Calibration works best when it is a genuine conversation instead of a process of pushing everyone toward the average. The goal is shared understanding, which is different from artificial uniformity. A team that has talked through why they disagree will produce better scores than one that has been drilled to use the same number.

The register should reflect what the team actually believes, with enough structure to make those beliefs comparable. That is the balance worth aiming for.

If you want to see what a calibrated, consistently scored register looks like in practice, Risk Companion's free 14-day trial lets you configure your own framework definitions, run a real calibration session with your team, and see how the risk matrix and scoring hold up once everyone is working from the same starting point.

Why your risk scores are inconsistent and what to do about it

Key Takeaways