2025-track3 – AI CITY CHALLENGE

2025 Challenge Track Description

Track 3: Warehouse Spatial Intelligence

Understanding fine-grained spatial relationships in industrial environments remains a critical frontier for artificial intelligence. While AI systems have achieved impressive performance in image-based recognition and short-form spatial tasks, they often struggle to reason about 3D object layouts, dimensions, and spatial relations in real-world, logistics-scale environments.

Previous work in visual question answering and 3D scene understanding has focused primarily on synthetic datasets or domestic scenes, leaving a substantial gap in industrial settings such as warehouses and logistics hubs. These complex environments feature diverse objects, dynamic layouts, and safety-critical conditions that require precise spatial reasoning beyond basic object detection or segmentation.

The lack of datasets and benchmarks targeting physical spatial intelligence in operational contexts has slowed progress in deploying AI for industrial automation, safety monitoring, and inventory management. By focusing on warehouse-scale 3D scene understanding through natural language questions, this challenge aims to bridge that gap — advancing AI systems capable of integrating visual perception, geometric reasoning, and language comprehension in practical, high-impact domains.

- Task

In this challenge, participants are tasked with developing solutions capable of answering spatial reasoning questions such as “What are the dimensions of X?”, “What is the distance between X and Y?”, as well as more complex queries related to safety and logistics within a warehouse environment.

Solutions may fall into, but are not limited to, the following categories:

- - Agentic workflows comprising multiple models and API call functions
  - A 3D-VLM model designed to answer spatial reasoning questions
  - An exhaustive scene graph generation pipeline followed by LLM-based probing

Participants are required to submit the textual outputs generated by their workflows in response to a predefined set of questions, which will be used for evaluation. Additionally, participants are encouraged to incorporate supplementary data or enhance annotations through auto-labeling workflows at their discretion.

- Data

PhysicalAI-Spatial-Intelligence-Warehouse contains warehouse scenes with 3D object layouts, dimensions and spatial relations in the real-world, along with question-answer pairs for testing spatial reasoning capabilities.

The dataset was generated using the Isaacsim.Replicator.Agent (IRA) and Isaacsim.Replicator.Object (IRO) extensions in the NVIDIA Omniverse platform.

- Submission Format

Participants need to submit a single text file predictions.json for the test split. Example entry:

[
    {
        "id": "000123",
        "normalized_answer": "1.22"
    },
    {
        "id": "ab23dm",
        "normalized_answer": "left"
    },
    { 
        "id": "ab24dm", 
        "normalized_answer": "4" 
    },
    …
]

Upload the file to the challenge server; scores are computed automatically, and the leaderboard will update afterwards.

- Evaluation

The primary leaderboard metric is the weighted average success rate across all questions. Each question is counted as success = 1 if the prediction meets the Acc@10 or class-correct criterion described below; otherwise, success = 0.

Question categories along with examples and primary metrics are explained below:

Category	Typical Query Example	Primary Metric (counts toward success)	Secondary Metric (for detailed analysis)
Distance	“How far is pallet A1 from the forklift B1?”	Acc@10 (prediction within ±10% of GT)	Relative error (%)
Count	“How many pallets are in region D?”	Acc@10	Relative error (%)
Multiple-Choice-Grounding	“Which buffer zone is closest to robot X1?”	Accuracy	—
Spatial Relation	“Is box E left of box F?”	Accuracy	—

Fine-grained analysis:

For quantitative categories (distance, dimension, count), relative error statistics will also be reported.

Answer normalization:

To handle variability in response formats, we only evaluate the final answer in the normalized form. Participants can refer to the ‘normalized_answer’ key in our `train.json` and `val.json` to make sure that the predictions are in the normalized form, which is a single-word answer with canonical units for the corresponding question. We automatically handle lower/upper cases, and the conversion between digits and word, like both ‘4’ and ‘four’ are acceptable answers. Here are some examples on expected answer normalization:

- - Region [0] and Region [1] has distance 1.22 meters → 1.22
  - There are four pallets sitting in Buffer Zone 1 → 4 or four / Four
  - Region [0] is left to Region [2] → left
  - Region [3] is the forklift that is closest to Region [0] → 3
- Data Access

The data can be found here on Hugging Face.