OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Tengjin Weng^1,2 Wenhao Jiang^2* Jingyi Wang⁴ Ming Li² Lin Ma⁵ Zhong Ming^1,2,3*

¹College of Computer Science and Software Engineering, Shenzhen University
²Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
³Shenzhen Technology University
⁴Tsinghua Shenzhen International Graduate School
⁵Meituan

Paper HF Dataset Code

CVPR 2026

What is OddGridBench?

OddGridBench is a controllable benchmark designed to evaluate the fine-grained visual discrepancy sensitivity of MLLMs. Each instance contains a grid of visually similar icons where one element differs from the others in attributes such as color, size, rotation, or position.

Example tasks in OddGridBench. OddGridBench evaluates visual discrepancy perception across four primary visual attributes: color, size, rotation, and position. It also includes multi-attribute discrepancy compositions (2-Type, 3-Type, and 4-Type), enabling systematic evaluation of fine-grained visual discrimination ability in MLLMs.

VisNumBench -- Charateristic

VisNumBench is a benchmark for evaluating the number sense capabilities of MLLMs.

Explanations of number sense: how humans intuitively perceive and estimate values of angle, quantity, length, and scale.

VisNumBench encompasses seven visual numerical attributes and four visual numerical estimation tasks. While these problems can be effectively solved using human numerical intuition, they surpass the capabilities of current MLLMs.

Abstract

MLLMs have demonstrated strong capabilities in many vision-language tasks. However, their ability to detect fine-grained visual discrepancies remains underexplored. We introduce OddGridBench, a controllable benchmark for evaluating visual discrepancy sensitivity in MLLMs. The benchmark contains over 1,400 grid-based images where a single element differs from others in attributes such as color, size, rotation, or position. Experiments reveal that even state-of-the-art MLLMs perform significantly below human performance in detecting subtle visual discrepancies. To address this limitation, we further propose OddGrid-GRPO, a reinforcement learning framework integrating curriculum learning and distance-aware reward design to improve perceptual discrimination.

Data Processing

We evaluate 19 representative MLLMs across open-source and proprietary model families. Results show that current models exhibit limited sensitivity to subtle visual discrepancies, especially under rotation and positional perturbations. Even the strongest models remain far below human performance, highlighting a fundamental gap between machine perception and human visual sensitivity.

Methodology

We evaluate 19 representative MLLMs across open-source and proprietary model families. Results show that current models exhibit limited sensitivity to subtle visual discrepancies, especially under rotation and positional perturbations. Even the strongest models remain far below human performance, highlighting a fundamental gap between machine perception and human visual sensitivity.

Results

We evaluate 19 representative MLLMs across open-source and proprietary model families. Results show that current models exhibit limited sensitivity to subtle visual discrepancies, especially under rotation and positional perturbations. Even the strongest models remain far below human performance, highlighting a fundamental gap between machine perception and human visual sensitivity.

BibTeX


      @inproceedings{weng2025visnumbench,
        title={VisNumBench: Evaluating Number Sense of Multimodal Large Language Models},
        author={Tengjin Weng and Wenhao Jiang and Jingyi Wang and Zhong Ming},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        year={2025}
      }