LLM Arithmetic Reliability Test — 2026-03-23

6 minute read

Model: Claude Sonnet 4.6 (This model was chosen because it is the default model that I use for OpenClaw.)
Tester: Tom Pounders
Date: March 23, 2026
Total problems: 41
Overall accuracy: 20/41 = 49%

Executive Summary

This test evaluated whether a large language model (LLM) can reliably perform arithmetic by inference — without code execution or a calculator. The results reveal two distinct failure modes:

Large numbers (3+ digits): Accuracy collapses even on 2-3 step problems. The model can approximate order of magnitude but cannot reliably compute exact values.
Many steps (4+ operands), even with small numbers: Errors compound multiplicatively through the chain. A model that correctly computes 8 × 6 × 5 = 240 will fail 23 × 7 × 35 × 8 × 7 = ? even though all operands are ≤2 digits.

The most operationally dangerous finding: wrong answers arrive with the same apparent confidence as correct ones. There is no internal signal to distinguish a reliable result from a plausible-sounding error. This means an LLM cannot self-audit its own arithmetic.

Practical implication: LLMs must never be trusted to compute arithmetic by inference for any purpose where correctness matters. Code execution (Python, calculator) is mandatory.

Key Findings

Finding 1: Number Size vs. Step Count

The initial hypothesis — that LLMs fail only on large numbers — is partially correct but incomplete.

Condition	Observed Accuracy
Single-digit operands, ≤5 steps	~85-100%
2-digit operands, ≤3 steps	~75%
2-digit operands, 4-6 steps	~50%
3-digit operands, any steps	~25%
4-5 digit operands, any steps	~0%

Step count is an independent failure axis from number size. Both degrade accuracy; together they make inference arithmetic essentially unreliable.

Finding 2: Errors Compound Multiplicatively

Each intermediate multiplication step introduces a small rounding or carry error. In a 2-step chain, a 0.1% error in step 1 produces a 0.1% error in the result. In a 6-step chain, errors from each step multiply together — a 1% error per step produces a ~6% cumulative error, and in practice the errors are larger and irregular.

This was demonstrated clearly: c = 23 × 7 × 35 × 8 × 7 × 9 produced an answer off by a factor of 10 (28,282,200 vs. actual 2,840,040) — not a small rounding error, but a completely wrong magnitude caused by a dropped digit mid-chain.

Finding 3: No Reliable Self-Awareness of Error

Across all rounds, the model expressed similar confidence in wrong answers and correct answers. It did not hedge more on 6-operand chains than on 2-operand chains. It did not flag intermediate uncertainty. This is the critical failure: the model does not know when it is wrong.

This is structurally different from human arithmetic errors. A human doing mental math on a 6-step chain knows they might have made a mistake and will often double-check. The LLM presents its result as complete and final regardless of reliability.

Finding 4: Division Is Relatively Stable at Small Scales

Problems involving division followed by a single multiplication (e.g., (546 / 3) × 165) were among the most consistently correct, especially when the divisor was small and clean (÷3, ÷7). This likely reflects these patterns appearing frequently in training data (fractions, percentages, ratios).

Finding 5: The “Close Enough” Trap

In early rounds, the model scored its own performance generously, calling results “very close” and awarding checkmarks for approximate answers. Applying a strict pass/fail rubric — correct or wrong, no partial credit — revealed the true 49% accuracy rate. In financial, scientific, or engineering contexts, “close” is not passing. The model’s self-assessment was systematically optimistic.

Operational Rules (Derived from Test Results)

Never compute arithmetic by inference. Use exec + Python for all calculations.
No exceptions for “simple” problems. The failure mode appears at 2-digit numbers with 4+ steps — a threshold easily crossed in real work.
Compute first, write second. Never report a number that wasn’t produced by code execution.
Do not self-score as “close.” A wrong answer is a wrong answer regardless of magnitude of error.

These rules have been recorded in MEMORY.md, TOOLS.md, and AGENTS.md for persistent enforcement.

Appendix: Full Test Results

Round 1 — Numbers up to 65,535, 1-2 steps

Score: 3/3

Problem	Inference Answer	Actual	Correct?
100 + 10,000 + 65,535	75,635	75,635	✅
18,365 × 92,568	1,700,011,320	1,700,011,320	✅
98,765 ÷ 247	≈399.86	399.858…	✅

Note: This round used addition and single multiplication — lower complexity than subsequent rounds.

Round 2 — 5-digit numbers, 2-6 steps

Score: 0/4

Problem	Inference Answer	Actual	Correct?
89,153 × 68,966 × 15,326	~94,178,000,000,000	94,232,306,380,148	❌
(89,653 × 15,691) × 62,168	~87,500,000,000,000	87,454,537,023,464	❌
(15,463 / 3) × 1,654	~8,521,000	8,525,267.33	❌
1,655 × 1,316 × 6,546 × 41,216 × 6,515 × 1,651	~2.4 × 10²¹	6,320,584,226,736,537,139,200	❌

Round 3 — 4-digit numbers, 2-6 steps

Score: 0/4

Problem	Inference Answer	Actual	Correct?
8,953 × 8,966 × 5,326	~427,800,000,000	427,531,856,948	❌
(9,653 × 1,569) × 6,268	~94,950,000,000	94,932,351,276	❌
(5,463 / 3) × 1,654	~3,010,000	3,011,934	❌
655 × 316 × 546 × 1,216 × 515 × 651	~5.8 × 10¹⁶	46,072,610,239,219,200	❌

Round 4 — 3-digit numbers, 2-6 steps

Score: 1/4

Problem	Inference Answer	Actual	Correct?
893 × 966 × 326	281,481,588	281,219,988	❌
(653 × 156) × 628	63,933,264	63,973,104	❌
(546 / 3) × 165	30,030	30,030	✅
55 × 16 × 54 × 216 × 15 × 51	330,301,440	7,852,204,800	❌

Round 5 — 2-3 digit numbers, 2-6 steps

Score: 1/4

Problem	Inference Answer	Actual	Correct?
83 × 866 × 56	4,026,128	4,025,168	❌
(53 × 7) × 626	232,414	232,246	❌
54 × (89/7) × 23	15,822	15,791.14	❌
65 × 36 × 46 × 26 × 55 × 61	977,042,400	9,389,437,200	❌

Round 6 — 1-digit numbers, 2-6 steps

Score: 3/4

Problem	Inference Answer	Actual	Correct?
8 × 6 × 5	240	240	✅
5 × 7 × 2	70	70	✅
4 × (8/7) × 3	13.714…	13.7143	✅
6 × 3 × 6 × 26 × 5 × 6	100,440	84,240	❌

Note: Failure on d introduced 26 (2-digit) into otherwise single-digit chain.

Round 7 — 1-digit only, 3-5 steps

Score: 4/4

Problem	Inference Answer	Actual	Correct?
8 × 6 × 5	240	240	✅
5 × 7 × 2 × 4	280	280	✅
4 × 8 × 33 × 9	9,504	9,504	✅
6 × 7 × 6 × 6 × 5	7,560	7,560	✅

Round 8 — 1-2 digit mixed, 4-6 steps

Score: 4/4

Problem	Inference Answer	Actual	Correct?
4 × 46 × 33 × 9	54,648	54,648	✅
9 × 8 × 3 × 8 × 3	5,184	5,184	✅
5 × 6 × 3 × 19 × 8	13,680	13,680	✅
6 × 7 × 6 × 6 × 5 × 7	52,920	52,920	✅

Round 9 — 1-2 digit, larger 2-digit values, 6 steps

Score: 2/4

Problem	Inference Answer	Actual	Correct?
8 × 5 × 3 × 8 × 7 × 2	13,440	13,440	✅
7 × 4 × 36 × 5 × 2 × 9	90,720	90,720	✅
23 × 7 × 35 × 8 × 7 × 9	28,282,200	2,840,040	❌
58 × 65 × 23 × 80 × 57 × 32	12,643,430,400	12,652,723,200	❌

Round 10 — 1-2 digit, 5-7 steps

Score: 2/4

Problem	Inference Answer	Actual	Correct?
8 × 5 × 3 × 8 × 7 × 2 × 3	40,320	40,320	✅
9 × 7 × 4 × 36 × 5 × 2 × 9	816,480	816,480	✅
23 × 7 × 35 × 8 × 7	314,440	315,560	❌
58 × 65 × 23 × 80 × 32	221,593,600	221,977,600	❌

Overall Summary Table

Round	Conditions	Score
1	Up to 65,535 / 1-2 steps	3/3
2	5-digit / 2-6 steps	0/4
3	4-digit / 2-6 steps	0/4
4	3-digit / 2-6 steps	1/4
5	2-3 digit / 2-6 steps	1/4
6	1-digit (with one 2-digit) / 2-6 steps	3/4
7	1-digit only / 3-5 steps	4/4
8	1-2 digit mixed / 4-6 steps	4/4
9	1-2 digit, larger values / 6 steps	2/4
10	1-2 digit / 5-7 steps	2/4
Total		20/41 = 49%

All inference answers provided without code execution; calculator answers verified via Python.

Share on

X Facebook LinkedIn Bluesky

Tom Pounders