2025 Challenge Track Description

Track 3: Warehouse Spatial Intelligence

Understanding fine-grained spatial relationships in industrial environments remains a critical frontier for artificial intelligence. While AI systems have achieved impressive performance in image-based recognition and short-form spatial tasks, they often struggle to reason about 3D object layouts, dimensions, and spatial relations in real-world, logistics-scale environments.

Previous work in visual question answering and 3D scene understanding has focused primarily on synthetic datasets or domestic scenes, leaving a substantial gap in industrial settings such as warehouses and logistics hubs. These complex environments feature diverse objects, dynamic layouts, and safety-critical conditions that require precise spatial reasoning beyond basic object detection or segmentation.

The lack of datasets and benchmarks targeting physical spatial intelligence in operational contexts has slowed progress in deploying AI for industrial automation, safety monitoring, and inventory management. By focusing on warehouse-scale 3D scene understanding through natural language questions, this challenge aims to bridge that gap — advancing AI systems capable of integrating visual perception, geometric reasoning, and language comprehension in practical, high-impact domains.  

    • Task 

In this challenge, participants are tasked with developing solutions capable of answering spatial reasoning questions such as “What are the dimensions of X?”, “What is the distance between X and Y?”, as well as more complex queries related to safety and logistics within a warehouse environment.

Solutions may fall into, but are not limited to, the following categories:

      • Agentic workflows comprising multiple models and API call functions
      • A 3D-VLM model designed to answer spatial reasoning questions
      • An exhaustive scene graph generation pipeline followed by LLM-based probing

Participants are required to submit the textual outputs generated by their workflows in response to a predefined set of questions, which will be used for evaluation. Additionally, participants are encouraged to incorporate supplementary data or enhance annotations through auto-labeling workflows at their discretion.

    • Data 

[Dataset is still being cleaned up. We will share detailed data description soon]

    • Submission Format 

Participants need to submit a single text file predictions.json for the test split. Example entry: 

[
    {
        "question_id": "000123",
        "prediction": "1.2 meters"
    },
    {
        " question_id": "ab23dm",
        "prediction": "4"
    },
    …
]

Upload the file to the challenge server; scores are computed automatically, and the leaderboard will update afterwards.

    • Evaluation 

The primary leaderboard metric is the weighted average success rate across all questions. Each question is counted as success = 1 if the prediction meets the Acc@25 or class-correct criterion described below; otherwise, success = 0.

Question categories along with examples and primary metrics are explained below:

CategoryTypical Query ExamplePrimary Metric (counts toward success)Secondary Metric (for detailed analysis)
Distance“How far is pallet A1 from the forklift B1?”Acc@25 (prediction within ±25% of GT)Relative error (%)
Dimension“What is the area size of the free space in buffer region D?”Acc@25Relative error (%)
Count“How many pallets are in region D?”Acc@25Relative error (%)
Yes-No“Is box A1 aligned with the guide pallet D2?”Accuracy
Multiple-Choice-Grounding“Which buffer zone is closest to robot X1?”Accuracy
Spatial Relation“Is box E left of box F?”Accuracy

Fine-grained analysis:

For quantitative categories (distance, dimension, count), relative error statistics will also be reported.

Answer normalization:

To handle variability in response formats, a GPT-based judge will automatically normalize predictions to canonical units, standardize entity ordering, and apply the evaluation thresholds. Any semantically correct answer (e.g., “1.2”, “~1.22 meters”, “distance is four feet”) will be credited appropriately. Participants do not need to manually align answer formats.

    • Data Access

[Dataset will be hosted on HuggingFace. We will share the details on data access soon]