Spreadsheet-RL

News

2026-06-09🧪 Added SpreadsheetBench-Verified to the Spreadsheet-RL dataset, including verified spreadsheet artifacts and parser-specific parquet splits.
2026-06-03🔄 Refreshed spreadsheet artifacts, removing samples with abnormal recalculation behavior, including excessive latency and memory usage; corresponding parquet splits are also updated.
2026-05-23🚀 Released the Spreadsheet-RL-4B model checkpoint on Hugging Face at Spreadsheet-RL/Spreadsheet-RL-4B, the RL-trained Qwen/Qwen3-4B-Thinking-2507 spreadsheet agent used in the paper.
2026-05-22🌐 The Spreadsheet-RL project page is now live at https://spreadsheet-rl.github.io/, with the paper overview, framework, results, resources, and citation.
2026-05-21📄 The Spreadsheet-RL arXiv preprint is available at arXiv:2605.22642, and the paper is featured on Hugging Face Daily Papers.
2026-05-17📦 Code and dataset release for Spreadsheet-RL. The code is available on GitHub at Spreadsheet-RL/Spreadsheet-RL, with training configs, Slurm scripts, the Excel reward service, SandboxFusion setup, and the verl integration. The dataset is available on Hugging Face at Spreadsheet-RL/Spreadsheet-RL, with parquet splits and workbook files.

Abstract

Spreadsheet systems such as Microsoft Excel and Google Sheets are central to modern data-centric workflows, but existing spreadsheet agents often rely on prompt engineering over general-purpose models and struggle with complex, multi-step tasks. Spreadsheet-RL is an RL fine-tuning framework for training specialized spreadsheet agents inside a realistic Microsoft Excel environment.

The framework combines scalable start-goal spreadsheet construction, a multi-turn Spreadsheet Gym with spreadsheet-native tools and sandboxed code execution, and outcome-based GRPO training. On SpreadsheetBench, Spreadsheet-RL improves Qwen3-4B-Thinking-2507 Pass@1 from 12.0% to 23.4%; on Domain-Spreadsheet, it improves Pass@1 from 8.4% to 17.2%.

5,925 released ExcelForum training tasks

23.4% SpreadsheetBench Pass@1 after RL

1,660 Domain-Spreadsheet evaluation rollouts

17.2% Domain-Spreadsheet Pass@1 after RL

Framework

Spreadsheet-RL links realistic data construction, faithful Excel interaction, and verifiable outcome rewards into one reproducible training loop.

Spreadsheet Data Agent

Collects public ExcelForum threads after January 1, 2024, synthesizes oracle final workbooks with coding agents, and filters tasks through rule-based validation.

Spreadsheet Gym

Runs multi-turn agent rollouts in Microsoft Excel with isolated workspaces, spreadsheet-native tools, and SandboxFusion-backed code execution.

Outcome-Based RL

Uses an asynchronous Excel reward API to recalculate final workbooks and compare target ranges against oracle workbooks for GRPO training.

Results

Spreadsheet-native harnessing, richer tool access, and RL post-training each improve the same 4B open-source base model.

SpreadsheetBench Pass@1

Qwen3-4B-Thinking-2507 Setting	Environment	Pass@1
Base model	Spreadsheet Gym	12.0
+ Spreadsheet-native interaction harness	Spreadsheet Gym	15.6
+ Comprehensive spreadsheet-tool access	Spreadsheet Gym	19.3
+ Spreadsheet-RL post-training	Spreadsheet Gym	23.4

Domain-Spreadsheet Pass@1

Domain	#Eval.	Base	RL
Finance-B	597	15.6	29.3
Finance-I	388	7.7	16.2
Finance-A	135	8.1	19.3
Supply Chain	180	1.1	5.0
HR	185	0.5	3.2
Sales	86	1.2	5.8
Real Estate	89	1.1	1.1
Overall	1,660	8.4	17.2

Training dynamics plots for reward, response length, turns, and SpreadsheetBench accuracy over 60 training steps. — **Figure 2.** RL training raises reward and validation accuracy while reducing rollout length and mean number of turns.

Domain-Spreadsheet

Domain-Spreadsheet is a domain-specific benchmark covering finance, supply chain management, human resources, sales, and real estate. It emphasizes professional analytical workflows such as comparable-company analysis, value-at-risk computation, inventory analysis, compensation benchmarking, and property valuation.

The released Hugging Face dataset contains parser-specific parquet files and a workbook archive with ExcelForum training tasks, SpreadsheetBench tasks, SpreadsheetBench-Verified tasks, and Domain-Spreadsheet tasks.

Example Domain-Spreadsheet finance task about monitoring collateral under credit support annexes. — **Figure 3.** Example finance workflow from Domain-Spreadsheet.

Resources

Paper arXiv:2605.22642

Code Training, tools, reward service, and verl integration

Dataset Parquet splits and workbook files

Model Spreadsheet-RL-4B checkpoint

HF Daily Papers Paper discussion and upvote page

Citation

BibTeX

@misc{chi2026spreadsheetrl,
  title         = {Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning},
  author        = {Banghao Chi and Yining Xie and Mingyuan Wu and Jingcheng Yang and Jize Jiang and Zhaoheng Li and Shengyi Qian and Minjia Zhang and Klara Nahrstedt and Rui Hou and Xiangjun Fan and Hanchao Yu},
  year          = {2026},
  eprint        = {2605.22642},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  doi           = {10.48550/arXiv.2605.22642},
  url           = {https://arxiv.org/abs/2605.22642}
}