What Makes a Good Overwatch Reviewer? Patterns from 50,000 Verdicts

We have 50,000 closed verdicts in the CSWatch database. Each came with a reviewer identity, a vote, a confidence-modifier, and an outcome (whether the case was eventually convicted by 3+ guilty consensus). That gives us an unusual window into what separates a careful judge from someone who shouldn't be reviewing.

The accuracy-rate distribution

We score each reviewer on alignment-with-consensus: out of every closed case they voted on, how often did their vote match the eventual verdict? The distribution is bimodal. The top 25% of reviewers run accuracy rates of 92-98%. The bottom 25% run 50-65%, which is barely better than coin flipping.

The middle 50% is the interesting band. Most active reviewers fall here — accuracy in the 75-90% range — and the difference between an 80%-accurate reviewer and a 95%-accurate reviewer is mostly behavioural, not skill-based.

What top reviewers do

They take their time

The top reviewers spend an average of 9.3 minutes per case. The bottom group spends 2.1 minutes. The single best predictor of verdict accuracy is how long the reviewer watched the demo before voting.

This sounds obvious, but it has a non-obvious implication: the reviewer pool divides cleanly between people watching demos and people skimming summary stats. We've added forced-watch checkpoints (random rounds where the reviewer must answer a content question to advance) to nudge the pool toward actually watching.

They use insufficient-evidence aggressively

Top reviewers vote "insufficient evidence" on 18% of cases. Bottom-tier reviewers vote "insufficient" on 4%. The willingness to admit uncertainty is itself a skill.

Cases that should be insufficient are typically ones where the demo doesn't contain enough rounds to differentiate a good player from a wallhacker. A 2-round clip of someone hitting tight angles is not a basis for conviction. Top reviewers recognise that and vote accordingly.

They calibrate to base rates

Most accounts under review are not cheating. We see this in the consensus distribution: roughly 38% of submitted cases result in conviction; 62% are eventually cleared. A reviewer voting guilty 80% of the time is poorly calibrated — they have a prior set higher than the true base rate.

Top reviewers vote guilty at roughly 35-45%, matching the true population rate. Bottom reviewers fall into two error patterns: over-convictors who vote guilty 70%+, and under-convictors who almost never vote guilty. Both are bad signals.

They notice friction in their own confidence

We have an optional confidence-modifier on each vote (0-100 slider). Top reviewers use this slider extensively, lowering their confidence on edge cases. Their confidence-weighted accuracy is much higher than their raw accuracy because their low-confidence votes correctly correlate with cases that go to insufficient.

The bottom group either ignores the slider entirely (leaving it at 100) or uses it randomly. The information content of their confidence number is essentially zero.

What bad reviewers look like

Three failure patterns dominate.

Witch-hunters

High guilty-vote rate, low time-per-case, low insufficient-evidence rate. They're mostly reviewing to convict. Witch-hunters are the most damaging failure mode because their false-positives stick — when 2 witch-hunters and 1 careful reviewer all vote guilty, the case converts even if the careful reviewer expressed low confidence.

Our defence: we down-weight the votes of reviewers whose accuracy drops below 70% for 30+ days, and we eject reviewers whose accuracy drops below 60% for any sustained period. About 4% of the active pool ends up rotated out per quarter.

Lazy clears

High not-guilty rate, low time-per-case. They're voting clear by default. Low- risk for false positives but high-risk for letting actual cheaters back into the pool. Less harmful than witch-hunters but corrosive to community trust.

Pattern-blind reviewers

These reviewers' accuracy is decent on clear-cut cases but drops sharply on edge cases. They're missing nuance. Often they're newer reviewers who haven't built up the demo-pattern library yet. Coaching helps; experience helps.

What we've changed based on this data

Three concrete changes followed from analysing the reviewer corpus:

Reviewer accuracy now visible on profile. Reviewers see their own accuracy rate, time-per-case, and confidence-calibration score. Visibility alone improved median accuracy by 3 percentage points within a quarter.
Vote-weight adjusted by accuracy. Reviewers in the top tier have their votes weighted at 1.4x. Bottom-tier reviewers (still active but on probation) have their votes weighted at 0.6x. Convictions require not 3 guilty votes but 3.0 guilty weighted votes.
Slow-mode for new reviewers. Anyone with under 50 closed cases is rate-limited to 5 cases per day, with a forced 30-second minimum per case. This alone cut new-reviewer error rates by 40%.

The broader point

Community moderation is not a free resource. It's a system that needs governance as deliberate as the technical anti-cheat layers. CSWatch's reviewer corpus is treated as a calibrated instrument, with measurement, feedback, and continuous improvement. Without that, you don't get crowd wisdom — you get a Twitter mob.

If you're a CSWatch reviewer reading this and wondering where you fall in the distribution: check your dashboard. The numbers are real, the feedback is real, and most reviewers improve substantially after seeing them. The system rewards careful judgment because that's what it's designed to.