VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

1Shenzhen University 
2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) 
3Shenzhen Technology University 
4Tsinghua Shenzhen International Graduate School 

What is VisNumBench?

VisNumBench comprises seven number sense tasks derived from both synthetic and real-world scenarios. While these tasks can be efficiently solved by human numerical intuition, they pose significant challenges for current multimodal large language models (MLLMs).


Example tasks in VisNumBench. VisNumBench comprises seven visual numerical attributes—angle perception, scale perception, length perception, quantity perception, depth perception, area perception, and volume perception—along with four visual numerical estimation tasks: value comparison, value estimation, range estimation, and multiplicative estimation. These tasks are designed to comprehensively assess the number sense abilities of MLLMs.

VisNumBench -- Charateristic

  • VisNumBench is a benchmark for evaluating the number sense capabilities of MLLMs.

Explanations of number sense: how humans intuitively perceive and estimate values of angle, quantity, length, and scale.

  • VisNumBench encompasses seven visual numerical attributes and four visual numerical estimation tasks. While these problems can be effectively solved using human numerical intuition, they surpass the capabilities of current MLLMs.

Abstract

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested—including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash—perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMswith larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing LVLMs' number sense abilities.

Results

Results of various models on the VisNumBench dataset. The first table presents accuracy in the synthetic scenario, while the second table reports accuracy in the real-world scenario.

The average accuracy of open-source MLLMs with 3B and 13B parameters ranged from approximately 28% to 42% across both synthetic and real-world scenarios. As model sizes increased to 38B, 72B, and 78B parameters, the average accuracy improved to around 55%. However, the performance of proprietary models remained suboptimal, with only Gemini 2.0 Flash achieving a relatively high accuracy of 56%. Other models, such as GPT-4o, attained an average accuracy of merely 40%.

BibTeX


        @article{weng2025visnumbench,
          title={VisNumBench: Evaluating Number Sense of Multimodal Large Language Models},
          author={Tengjin Weng and Wenhao Jiang and Jingyi Wang and Zhong Ming},
          journal={arXiv preprint arXiv:2503.14939},
          year={2025}
        }