MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Key Ideas

Model-Agnostic TableQA

MATA is evaluated with ten LLMs, covering small open-source models below 10B parameters and larger or closed-source models.

Diverse Reasoning

The framework combines CoT, PoT, and text-to-SQL so that table questions can benefit from textual, code-based, and SQL-based reasoning.

Efficient Orchestration

Scheduler and Confidence Checker modules reduce unnecessary LLM-agent calls while preserving strong answer selection.

Method

Given a table and a question, MATA first runs the CoT Agent and uses the Scheduler to prioritize either the PoT Agent or the text2SQL Agent. If the selected code-based path agrees with the CoT answer, MATA can skip the remaining code-based path; otherwise, it invokes the other path to increase answer diversity.

The system includes three lightweight tools and six LLM-based agents. The tools are the Scheduler, Confidence Checker, and Format Matcher. The agents are the CoT Agent, PoT Agent, text2SQL Agent, Python Debug Agent, SQL Debug Agent, and Judge Agent. The Scheduler uses MobileBERT with a two-layer MLP and has 24.65M parameters; the Confidence Checker is based on DeBERTaV3-large with about 435M parameters; the Format Matcher uses Qwen2.5-Instruct 0.5B without fine-tuning.

Agent selection: prioritize PoT or text2SQL while CoT runs in parallel.
Code generation and debugging: refine PoT and text2SQL outputs with dedicated debugging agents.
Answer selection: use the Confidence Checker for early selection and the Judge Agent when candidates require additional verification.

Results

MATA is evaluated on two benchmarks with Exact Match, fuzzy matching, and token-level F1. Penguins in a Table represents easier single-table reasoning, while TableBench contains larger and more complex questions across 18 subcategories.

Benchmark Summary

Penguins in a Table Average over 10 LLMs

EM 0.881 +8.8% vs. 0.810

Fuzzy 0.890 +6.1% vs. 0.839

F1 0.881 +8.6% vs. 0.811

Best baseline average: SynTQA for EM, fuzzy matching, and F1.

TableBench Average over 10 LLMs

EM 0.451 +40.1% vs. 0.322

Fuzzy 0.619 +21.9% vs. 0.508

F1 0.482 +33.1% vs. 0.362

Best baseline per metric: SynTQA for EM and F1, MixSC for fuzzy matching.

Ablation Overview

Ablation studies for MATA on Penguins in a Table and TableBench

Figure 2. Ablation results for Confidence Checker, Judge Agent, Format Matcher, and Scheduler modules.

The Confidence Checker is the most important component in the ablation study. It reduces Judge Agent invocations by 95.8% on Penguins in a Table and 60.6% on TableBench while maintaining or improving final accuracy. The Scheduler further reduces LLM-agent calls by 14.6% and 7.6% on the two benchmarks, respectively.

For locally hosted open-source backbones, MATA reports an average end-to-end latency of 27.55 seconds per query, compared with 48.89 seconds for TabLaP and 44.48 seconds for MixSC. SynTQA is faster at 6.86 seconds, but its fixed low-call budget is less effective on the harder TableBench setting.

Full Benchmark Tables

Table 2. Evaluation results on the Penguins in a Table benchmark. Bold indicates the best performance; underlined scores are the second best.

Model		TabLaP			SynTQA			MixSC			MATA
Group	Name	EM	Fuzzy	F1	EM	Fuzzy	F1	EM	Fuzzy	F1	EM	Fuzzy	F1
Small LLM	llama3.2-3b	0.188	0.290	0.247	0.597	0.654	0.602	0.201	0.303	0.252	0.736	0.766	0.736
	mistral-7b	0.049	0.231	0.102	0.639	0.680	0.645	0.271	0.385	0.289	0.861	0.880	0.861
	phi4-mini-3.8b	0.333	0.483	0.362	0.813	0.827	0.813	0.500	0.593	0.528	0.819	0.847	0.819
	qwen2.5-3b	0.396	0.479	0.400	0.694	0.737	0.694	0.438	0.517	0.442	0.868	0.883	0.868
	qwen2.5-7b	0.444	0.522	0.444	0.813	0.866	0.815	0.597	0.657	0.597	0.951	0.955	0.951
Large LLM	mistral-small-24b	0.764	0.784	0.773	0.896	0.918	0.896	0.806	0.813	0.810	0.896	0.896	0.896
	cogito-32b	0.931	0.934	0.931	0.868	0.886	0.868	0.903	0.908	0.903	0.903	0.903	0.903
	qwen2.5-32b	0.611	0.687	0.656	0.861	0.892	0.861	0.785	0.802	0.789	0.917	0.917	0.917
	GPT-4o	0.653	0.655	0.653	0.951	0.961	0.951	0.833	0.835	0.833	0.903	0.903	0.903
	Claude-3.7-Sonnet	0.868	0.868	0.868	0.965	0.970	0.965	0.924	0.924	0.924	0.951	0.951	0.951
Average		0.524	0.593	0.544	0.810	0.839	0.811	0.626	0.674	0.637	0.881	0.890	0.881

Table 3. Evaluation results on the TableBench benchmark. Bold and underline follow Table 2.

Model		TabLaP			SynTQA			MixSC			MATA
Group	Name	EM	Fuzzy	F1	EM	Fuzzy	F1	EM	Fuzzy	F1	EM	Fuzzy	F1
Small LLM	llama3.2-3b	0.067	0.357	0.130	0.089	0.231	0.120	0.081	0.372	0.144	0.354	0.563	0.381
	mistral-7b	0.036	0.331	0.119	0.227	0.367	0.270	0.082	0.355	0.151	0.294	0.473	0.321
	phi4-mini-3.8b	0.056	0.334	0.126	0.202	0.366	0.253	0.144	0.411	0.203	0.273	0.457	0.295
	qwen2.5-3b	0.163	0.417	0.195	0.208	0.364	0.245	0.163	0.417	0.197	0.291	0.471	0.317
	qwen2.5-7b	0.079	0.255	0.094	0.302	0.450	0.336	0.169	0.368	0.190	0.354	0.557	0.393
Large LLM	mistral-small-24b	0.322	0.478	0.352	0.391	0.543	0.431	0.378	0.530	0.410	0.573	0.724	0.606
	cogito-32b	0.440	0.614	0.483	0.443	0.591	0.481	0.430	0.614	0.476	0.577	0.723	0.609
	qwen2.5-32b	0.268	0.533	0.317	0.398	0.553	0.436	0.297	0.551	0.341	0.577	0.721	0.607
	GPT-4o	0.556	0.722	0.595	0.476	0.607	0.503	0.494	0.692	0.540	0.595	0.740	0.629
	Claude-3.7-Sonnet	0.612	0.763	0.655	0.489	0.633	0.540	0.619	0.767	0.659	0.620	0.764	0.664
Average		0.260	0.480	0.307	0.322	0.471	0.362	0.286	0.508	0.331	0.451	0.619	0.482

BibTeX

@misc{hyeon2026mata,
      title={MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering}, 
      author={Sieun Hyeon and Jusang Oh and Sunghwan Steve Cho and Jaeyoung Do},
      year={2026},
      eprint={2602.09642},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.09642}, 
}