RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

ICLR 2026 Poster

AIDAS Laboratory, 1IPAI & 2ECE, Seoul National University
subtitle of the main image

Large Reasoning Models (LRMs) can produce plausible-sounding rationales that don’t reflect their true decision process. We propose RFEval, a benchmark to measure reasoning faithfulness via output-level counterfactual interventions.

Abstract

LRMs can be highly accurate yet still give plausible-sounding rationales that don’t match their true decision process. We formalize reasoning faithfulness with two testable criteria—stance consistency (reasoning and answer align) and causal influence (the stated reasoning actually drives the answer under output-level interventions)—separate from accuracy. We introduce RFEval (7,186 instances, 7 tasks) to measure this via controlled counterfactual interventions, and find 49.7% of outputs are unfaithful, mostly due to stance inconsistency. Unfaithfulness is concentrated in brittle domains like math/code and relates more to post-training (e.g., RL-style objectives) than model scale: adding RL on top of SFT can reduce faithfulness even when accuracy holds. Overall, accuracy is a weak, unreliable proxy for faithfulness, motivating auditing and optimization for both correctness and reasoning integrity.

Method

RFEval formalizes reasoning faithfulness as a property distinct from answer accuracy, and operationalizes it with two testable conditions:

  • Stance consistency: the stated rationale maintains a coherent stance that supports the final answer.
  • Causal influence: under output-level counterfactual interventions that edit the rationale, the answer changes in the causally expected way.

To measure these conditions at scale, we curate a 7,186-instance benchmark across seven tasks and evaluate 12 open-source LRMs using controlled counterfactual edits and automated + human validation.

RFEval benchmark evaluation pipeline.

RFEval benchmark evaluation pipeline.

Main Results

Across 12 open-source LRMs, we observe substantial unfaithfulness: 49.7% of model outputs violate at least one faithfulness condition, with failures dominated by stance inconsistency. We also find that post-training choices (e.g., RL-style objectives) can reduce faithfulness even when accuracy is maintained, and that the accuracy–faithfulness association becomes weak once controlling for model and task.

Model CG MR LR TR CU LD PR Overall
RF (%) RF (%) RF (%) RF (%) RF (%) RF (%) RF (%) RF (%)
Qwen3-8B 21.15 37.97 72.74 58.11 43.97 48.64 3.09 41.95
Qwen3-32B 24.66 47.87 88.62 89.84 77.66 89.90 91.49 73.29
R1-Qwen-7B 38.25 29.54 82.13 44.46 76.31 70.63 81.49 61.37
R1-Qwen-32B 29.02 32.57 70.79 82.47 63.16 91.04 75.13 64.24
R1-Llama-8B 26.48 33.03 55.78 57.68 64.63 78.97 94.53 58.46
R1-Llama-70B 27.89 31.28 74.03 73.78 51.40 80.53 51.84 56.47
gpt-oss-20b 26.44 24.90 13.55 22.62 33.93 59.14 47.41 32.11
gpt-oss-120b 22.01 16.07 8.62 34.21 13.67 39.58 70.71 27.50
MiMo-RL 21.20 7.12 62.80 64.98 41.56 85.75 52.34 46.32
MiMo-RL-Zero 20.83 33.50 70.59 61.32 69.58 77.87 66.83 58.74
Magistral-Small 12.32 6.98 26.63 42.70 14.51 45.35 46.72 26.06
LN-Super_v1 26.48 44.90 77.13 69.38 81.70 80.38 98.47 68.52
Overall 24.18 28.06 58.28 57.92 51.66 70.17 58.03 50.27
CG: Code Generation, MR: Mathematical Reasoning, LR: Logical Reasoning, TR: Table Reasoning, CU: Context Understanding, LD: Legal Decision, PR: Paper Review.

BibTeX

@inproceedings{han2026rfeval,
        title={RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models},
        author={Yunseok Han and Yejoon Lee and Jaeyoung Do},
        booktitle={The Fourteenth International Conference on Learning Representations},
        year={2026},
        url={https://openreview.net/forum?id=2Gc8aj0afg}
      }