LRMs can be highly accurate yet still give plausible-sounding rationales that don’t match their true decision process. We formalize reasoning faithfulness with two testable criteria—stance consistency (reasoning and answer align) and causal influence (the stated reasoning actually drives the answer under output-level interventions)—separate from accuracy. We introduce RFEval (7,186 instances, 7 tasks) to measure this via controlled counterfactual interventions, and find 49.7% of outputs are unfaithful, mostly due to stance inconsistency. Unfaithfulness is concentrated in brittle domains like math/code and relates more to post-training (e.g., RL-style objectives) than model scale: adding RL on top of SFT can reduce faithfulness even when accuracy holds. Overall, accuracy is a weak, unreliable proxy for faithfulness, motivating auditing and optimization for both correctness and reasoning integrity.
RFEval formalizes reasoning faithfulness as a property distinct from answer accuracy, and operationalizes it with two testable conditions:
To measure these conditions at scale, we curate a 7,186-instance benchmark across seven tasks and evaluate 12 open-source LRMs using controlled counterfactual edits and automated + human validation.
Across 12 open-source LRMs, we observe substantial unfaithfulness: 49.7% of model outputs violate at least one faithfulness condition, with failures dominated by stance inconsistency. We also find that post-training choices (e.g., RL-style objectives) can reduce faithfulness even when accuracy is maintained, and that the accuracy–faithfulness association becomes weak once controlling for model and task.
| Model | CG | MR | LR | TR | CU | LD | PR | Overall |
|---|---|---|---|---|---|---|---|---|
| RF (%) | RF (%) | RF (%) | RF (%) | RF (%) | RF (%) | RF (%) | RF (%) | |
| Qwen3-8B | 21.15 | 37.97 | 72.74 | 58.11 | 43.97 | 48.64 | 3.09 | 41.95 |
| Qwen3-32B | 24.66 | 47.87 | 88.62 | 89.84 | 77.66 | 89.90 | 91.49 | 73.29 |
| R1-Qwen-7B | 38.25 | 29.54 | 82.13 | 44.46 | 76.31 | 70.63 | 81.49 | 61.37 |
| R1-Qwen-32B | 29.02 | 32.57 | 70.79 | 82.47 | 63.16 | 91.04 | 75.13 | 64.24 |
| R1-Llama-8B | 26.48 | 33.03 | 55.78 | 57.68 | 64.63 | 78.97 | 94.53 | 58.46 |
| R1-Llama-70B | 27.89 | 31.28 | 74.03 | 73.78 | 51.40 | 80.53 | 51.84 | 56.47 |
| gpt-oss-20b | 26.44 | 24.90 | 13.55 | 22.62 | 33.93 | 59.14 | 47.41 | 32.11 |
| gpt-oss-120b | 22.01 | 16.07 | 8.62 | 34.21 | 13.67 | 39.58 | 70.71 | 27.50 |
| MiMo-RL | 21.20 | 7.12 | 62.80 | 64.98 | 41.56 | 85.75 | 52.34 | 46.32 |
| MiMo-RL-Zero | 20.83 | 33.50 | 70.59 | 61.32 | 69.58 | 77.87 | 66.83 | 58.74 |
| Magistral-Small | 12.32 | 6.98 | 26.63 | 42.70 | 14.51 | 45.35 | 46.72 | 26.06 |
| LN-Super_v1 | 26.48 | 44.90 | 77.13 | 69.38 | 81.70 | 80.38 | 98.47 | 68.52 |
| Overall | 24.18 | 28.06 | 58.28 | 57.92 | 51.66 | 70.17 | 58.03 | 50.27 |
| CG: Code Generation, MR: Mathematical Reasoning, LR: Logical Reasoning, TR: Table Reasoning, CU: Context Understanding, LD: Legal Decision, PR: Paper Review. | ||||||||
@inproceedings{han2026rfeval,
title={RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models},
author={Yunseok Han and Yejoon Lee and Jaeyoung Do},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=2Gc8aj0afg}
}