Logo RoomSpace

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning


1University of Leeds, 2Alan Turing Institute

geometric reasoning

Example scenes in the RoomSpace benchmark.

🔔News

🔥[2024-05-20]: Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! 😆

Introduction

We introduce RoomSpace: a new benchmark designed to evaluate language models on spatial reasoning tasks demanding spatial relation knowledge and multi-hop reasoning. RoomSpace encompasses a comprehensive range of qualitative spatial relationships, including topological, directional, and distance relations. These relationships are presented from various viewpoints, with differing levels of granularity and density of relational constraints to simulate real-world complexities. This approach promotes a more accurate assessment of language models' capabilities in spatial reasoning tasks.

Logo RoomSpace Benchmark

Overview

We introduce RoomSpace: a new benchmark designed to evaluate language models on spatial reasoning tasks demanding spatial relation knowledge and multi-hop reasoning. RoomSpace encompasses a comprehensive range of qualitative spatial relationships, including topological, directional, and distance relations. These relationships are presented from various viewpoints, with differing levels of granularity and density of relational constraints to simulate real-world complexities. This approach promotes a more accurate assessment of language models' capabilities in spatial reasoning tasks.

Statistics

Experiment Results

Leaderboard

We evaluate various GPT models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. For all models, we use the story and question directly for Yes/No QA.

Human Expert Open-Source Proprietary
Reset n=3 n=4 n=5 n=6
Azure GPT-4* 0.56 0.34 0.21 0.16
Azure GPT-3.5-Turbo* 0.47 0.15 0.05 0.0
Azure GPT-3 Dalle* 0.46 0.25 0.08 0.09

Overall results of different models on the RoomSpace test set. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

BibTeX


      @article{li2024reframing,
	  title={Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning},
	  author={Li, Fangjun and Hogg, David C and Cohn, Anthony G},
	  journal={arXiv preprint arXiv:2405.15064},
	  year={2024}
	}