MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

ACL 2026 Findings

1Department of Electrical and Computer Engineering, Seoul National University
2Interdisciplinary Program in Artificial Intelligence, Seoul National University
Overview of the MATA multi-agent TableQA workflow

MATA coordinates Chain-of-Thought, Program-of-Thought, and text-to-SQL reasoning paths with lightweight tools for reliable and efficient TableQA.

Abstract

MATA is a multi-agent framework for Table Question Answering that targets reliability, flexibility, and efficiency in practical LLM deployment settings. Instead of depending on one reasoning style, MATA forms candidate answers through complementary Chain-of-Thought, Program-of-Thought, and text-to-SQL paths, then uses lightweight tools and specialized agents to select or refine the final answer.

The framework is designed to avoid unnecessary LLM calls while preserving answer diversity. Experiments on Penguins in a Table and TableBench with ten LLM backbones show that MATA achieves strong accuracy across open-source and proprietary models, with especially large gains on the more challenging TableBench benchmark.

Key Ideas

Model-Agnostic TableQA

MATA is evaluated with ten LLMs, covering small open-source models below 10B parameters and larger or closed-source models.

Diverse Reasoning

The framework combines CoT, PoT, and text-to-SQL so that table questions can benefit from textual, code-based, and SQL-based reasoning.

Efficient Orchestration

Scheduler and Confidence Checker modules reduce unnecessary LLM-agent calls while preserving strong answer selection.

Method

Given a table and a question, MATA first runs the CoT Agent and uses the Scheduler to prioritize either the PoT Agent or the text2SQL Agent. If the selected code-based path agrees with the CoT answer, MATA can skip the remaining code-based path; otherwise, it invokes the other path to increase answer diversity.

The system includes three lightweight tools and six LLM-based agents. The tools are the Scheduler, Confidence Checker, and Format Matcher. The agents are the CoT Agent, PoT Agent, text2SQL Agent, Python Debug Agent, SQL Debug Agent, and Judge Agent. The Scheduler uses MobileBERT with a two-layer MLP and has 24.65M parameters; the Confidence Checker is based on DeBERTaV3-large with about 435M parameters; the Format Matcher uses Qwen2.5-Instruct 0.5B without fine-tuning.

  • Agent selection: prioritize PoT or text2SQL while CoT runs in parallel.
  • Code generation and debugging: refine PoT and text2SQL outputs with dedicated debugging agents.
  • Answer selection: use the Confidence Checker for early selection and the Judge Agent when candidates require additional verification.

Results

MATA is evaluated on two benchmarks with Exact Match, fuzzy matching, and token-level F1. Penguins in a Table represents easier single-table reasoning, while TableBench contains larger and more complex questions across 18 subcategories.

Benchmark Summary

Penguins in a Table Average over 10 LLMs
EM 0.881 +8.8% vs. 0.810
Fuzzy 0.890 +6.1% vs. 0.839
F1 0.881 +8.6% vs. 0.811

Best baseline average: SynTQA for EM, fuzzy matching, and F1.

TableBench Average over 10 LLMs
EM 0.451 +40.1% vs. 0.322
Fuzzy 0.619 +21.9% vs. 0.508
F1 0.482 +33.1% vs. 0.362

Best baseline per metric: SynTQA for EM and F1, MixSC for fuzzy matching.

Ablation Overview

Ablation studies for MATA on Penguins in a Table and TableBench

Figure 2. Ablation results for Confidence Checker, Judge Agent, Format Matcher, and Scheduler modules.

The Confidence Checker is the most important component in the ablation study. It reduces Judge Agent invocations by 95.8% on Penguins in a Table and 60.6% on TableBench while maintaining or improving final accuracy. The Scheduler further reduces LLM-agent calls by 14.6% and 7.6% on the two benchmarks, respectively.

For locally hosted open-source backbones, MATA reports an average end-to-end latency of 27.55 seconds per query, compared with 48.89 seconds for TabLaP and 44.48 seconds for MixSC. SynTQA is faster at 6.86 seconds, but its fixed low-call budget is less effective on the harder TableBench setting.

Full Benchmark Tables

Table 2. Evaluation results on the Penguins in a Table benchmark. Bold indicates the best performance; underlined scores are the second best.

Model TabLaP SynTQA MixSC MATA
Group Name EMFuzzyF1 EMFuzzyF1 EMFuzzyF1 EMFuzzyF1
Small LLM llama3.2-3b 0.1880.2900.247 0.5970.6540.602 0.2010.3030.252 0.7360.7660.736
mistral-7b 0.0490.2310.102 0.6390.6800.645 0.2710.3850.289 0.8610.8800.861
phi4-mini-3.8b 0.3330.4830.362 0.8130.8270.813 0.5000.5930.528 0.8190.8470.819
qwen2.5-3b 0.3960.4790.400 0.6940.7370.694 0.4380.5170.442 0.8680.8830.868
qwen2.5-7b 0.4440.5220.444 0.8130.8660.815 0.5970.6570.597 0.9510.9550.951
Large LLM mistral-small-24b 0.7640.7840.773 0.8960.9180.896 0.8060.8130.810 0.8960.8960.896
cogito-32b 0.9310.9340.931 0.8680.8860.868 0.9030.9080.903 0.9030.9030.903
qwen2.5-32b 0.6110.6870.656 0.8610.8920.861 0.7850.8020.789 0.9170.9170.917
GPT-4o 0.6530.6550.653 0.9510.9610.951 0.8330.8350.833 0.9030.9030.903
Claude-3.7-Sonnet 0.8680.8680.868 0.9650.9700.965 0.9240.9240.924 0.9510.9510.951
Average 0.5240.5930.544 0.8100.8390.811 0.6260.6740.637 0.8810.8900.881

Table 3. Evaluation results on the TableBench benchmark. Bold and underline follow Table 2.

Model TabLaP SynTQA MixSC MATA
Group Name EMFuzzyF1 EMFuzzyF1 EMFuzzyF1 EMFuzzyF1
Small LLM llama3.2-3b 0.0670.3570.130 0.0890.2310.120 0.0810.3720.144 0.3540.5630.381
mistral-7b 0.0360.3310.119 0.2270.3670.270 0.0820.3550.151 0.2940.4730.321
phi4-mini-3.8b 0.0560.3340.126 0.2020.3660.253 0.1440.4110.203 0.2730.4570.295
qwen2.5-3b 0.1630.4170.195 0.2080.3640.245 0.1630.4170.197 0.2910.4710.317
qwen2.5-7b 0.0790.2550.094 0.3020.4500.336 0.1690.3680.190 0.3540.5570.393
Large LLM mistral-small-24b 0.3220.4780.352 0.3910.5430.431 0.3780.5300.410 0.5730.7240.606
cogito-32b 0.4400.6140.483 0.4430.5910.481 0.4300.6140.476 0.5770.7230.609
qwen2.5-32b 0.2680.5330.317 0.3980.5530.436 0.2970.5510.341 0.5770.7210.607
GPT-4o 0.5560.7220.595 0.4760.6070.503 0.4940.6920.540 0.5950.7400.629
Claude-3.7-Sonnet 0.6120.7630.655 0.4890.6330.540 0.6190.7670.659 0.6200.7640.664
Average 0.2600.4800.307 0.3220.4710.362 0.2860.5080.331 0.4510.6190.482

BibTeX

@misc{hyeon2026mata,
      title={MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering}, 
      author={Sieun Hyeon and Jusang Oh and Sunghwan Steve Cho and Jaeyoung Do},
      year={2026},
      eprint={2602.09642},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.09642}, 
}