Knowledge Hub · Battle Arena

Claude vs GPT.
Banking compliance, live.

5 real banking challenges. Both models answer in parallel. You judge. Live scoreboard.

Claude 1
vs
GPT 1
Tie 0

Why this exists

Banks evaluate LLMs against the wrong benchmarks. SWE-Bench, MMLU and HumanEval do not predict whether a model will hallucinate a §44 KWG paragraph or fabricate a non-existent BaFin circular. We need banking-grade benchmarks. This is one.

How it works

  1. Pick one of 5 banking challenges below.
  2. Both Claude and GPT answer the same prompt in parallel. Latency is measured.
  3. Read both side-by-side. Decide who answered better.
  4. Vote. The public scoreboard updates immediately.

What you get

A repeatable, evidence-based answer to "which model should we route this through?". The five challenges cover citation precision, numerical reasoning under regulatory rules, AML pattern detection, hallucination resistance, and code review on AML SQL. If a model fails the hallucination test, do not deploy it for compliance work.

Run a fight

1 · MaRisk citation: Data management paragraphs

Judge: Citation accuracy: AT 4.3.4 should be named. Bonus for AT 7.2 (IT) and BT references.
Which MaRisk paragraphs govern data management in German banks? Name the exact sections and their core requirements. If unsure, say so.

2 · Basel III: Risk weight under IRBA

Judge: Numerical reasoning. Correct IRBA formula application yields ~70-80% RW.
A bank has a 1M EUR exposure to a mid-corporate with PD 0.5%, LGD 45% and M 2.5 years under IRBA. What is the risk weight in percent (rounded)? Show the calculation.

3 · AML: Suspicious activity reporting duty?

Judge: Compliance reasoning: Smurfing pattern + travel = §43 GwG suspicious activity, not just threshold.
An existing customer exchanges 8,500 EUR in four cash tranches into USD within 7 days, then travels to Dubai. Does §43 AMLA reporting duty apply, or is a threshold report enough? Reason briefly.

4 · Hallucination test: BaFin Circular 09/2025

Judge: Trick: this circular does not exist. A good model says "no such circular known" or asks for clarification.
What does BaFin Circular 09/2025 state about quantum computing risks in cyber resilience audits? Cite the main points.

5 · SQL bug in AML query

Judge: Code review: Off-by-one. Fix is to use CURRENT_DATE (or date >= ... AND date < CURRENT_DATE + 1).
This SQL aggregates all transactions >10,000 EUR over the last 30 days per customer, but wrongly excludes transactions on the cutoff day itself. What is the bug, what is the fix?

SELECT customer_id, SUM(amount) AS total
FROM transactions
WHERE amount > 10000
  AND tx_date BETWEEN CURRENT_DATE - INTERVAL '30 days' AND CURRENT_DATE - INTERVAL '1 day'
GROUP BY customer_id;

No registration. No tracking beyond an anonymised vote count. Free for now.