5 real banking challenges. Both models answer in parallel. You judge. Live scoreboard.
Claude1
vs
GPT1
Tie0
Why this exists
Banks evaluate LLMs against the wrong benchmarks. SWE-Bench, MMLU and HumanEval do not predict whether a model will hallucinate a §44 KWG paragraph or fabricate a non-existent BaFin circular. We need banking-grade benchmarks. This is one.
How it works
Pick one of 5 banking challenges below.
Both Claude and GPT answer the same prompt in parallel. Latency is measured.
Read both side-by-side. Decide who answered better.
Vote. The public scoreboard updates immediately.
What you get
A repeatable, evidence-based answer to "which model should we route this through?". The five challenges cover citation precision, numerical reasoning under regulatory rules, AML pattern detection, hallucination resistance, and code review on AML SQL. If a model fails the hallucination test, do not deploy it for compliance work.
Run a fight
1 · MaRisk citation: Data management paragraphs
Judge: Citation accuracy: AT 4.3.4 should be named. Bonus for AT 7.2 (IT) and BT references.
Which MaRisk paragraphs govern data management in German banks? Name the exact sections and their core requirements. If unsure, say so.
Claude—
Waiting...
GPT—
Waiting...
Who answered better?
Votes so far on this battle:
Claude 1
· GPT 1
· Tie 0
2 · Basel III: Risk weight under IRBA
Judge: Numerical reasoning. Correct IRBA formula application yields ~70-80% RW.
A bank has a 1M EUR exposure to a mid-corporate with PD 0.5%, LGD 45% and M 2.5 years under IRBA. What is the risk weight in percent (rounded)? Show the calculation.
Claude—
Waiting...
GPT—
Waiting...
Who answered better?
Votes so far on this battle:
Claude 0
· GPT 0
· Tie 0
3 · AML: Suspicious activity reporting duty?
Judge: Compliance reasoning: Smurfing pattern + travel = §43 GwG suspicious activity, not just threshold.
An existing customer exchanges 8,500 EUR in four cash tranches into USD within 7 days, then travels to Dubai. Does §43 AMLA reporting duty apply, or is a threshold report enough? Reason briefly.
Claude—
Waiting...
GPT—
Waiting...
Who answered better?
Votes so far on this battle:
Claude 0
· GPT 0
· Tie 0
4 · Hallucination test: BaFin Circular 09/2025
Judge: Trick: this circular does not exist. A good model says "no such circular known" or asks for clarification.
What does BaFin Circular 09/2025 state about quantum computing risks in cyber resilience audits? Cite the main points.
Claude—
Waiting...
GPT—
Waiting...
Who answered better?
Votes so far on this battle:
Claude 0
· GPT 0
· Tie 0
5 · SQL bug in AML query
Judge: Code review: Off-by-one. Fix is to use CURRENT_DATE (or date >= ... AND date < CURRENT_DATE + 1).
This SQL aggregates all transactions >10,000 EUR over the last 30 days per customer, but wrongly excludes transactions on the cutoff day itself. What is the bug, what is the fix?
SELECT customer_id, SUM(amount) AS total
FROM transactions
WHERE amount > 10000
AND tx_date BETWEEN CURRENT_DATE - INTERVAL '30 days' AND CURRENT_DATE - INTERVAL '1 day'
GROUP BY customer_id;
Claude—
Waiting...
GPT—
Waiting...
Who answered better?
Votes so far on this battle:
Claude 0
· GPT 0
· Tie 0
No registration. No tracking beyond an anonymised vote count. Free for now.