2025-track1 – AI CITY CHALLENGE

2025 Challenge Track Description

Track 1: Multi-Camera 3D Perception

Challenge Track 1 involves synthetic animated people data in multiple indoor settings generated using the NVIDIA Omniverse Platform. The 2025 edition expands last year’s corpus with:

Split	Hours	Cameras	Scenes	Resolution / FPS	Objects in Training GT*	File Size
Train + Val	≈42 h	504	19 indoor layouts (warehouse, hospital, retail, office)	1080 p @ 30 fps	363 instances (292 person + service robots / forklifts)	74 GB ( depth maps optional, >3 TB)

* Counts refer to the training + validation ground-truth only; the test split is hidden before the release of evaluation system.

Each scene provides temporally-synchronized RGB video, camera calibration, a top-down map, and per-frame 2D/3D annotations. Depth maps (PNG-in-HDF5) are included but very large; feel free to ignore them if storage or I/O is a concern.

- Task

Teams should detect every object and keep the same identity ID while they move within and across all cameras in a scene.

- Submission Format

For compatibility with the official evaluation server, results must be a single plain-text file (track1.txt) where each line describes one detection:

〈scene_id〉 〈class_id〉 〈object_id〉 〈frame_id〉 〈x〉 〈y〉 〈z〉 〈width〉 〈length〉 〈height〉 〈yaw〉

Field	Type	Description
scene_id	int	Unique identifier for each multi-camera sequence.
class_id	int	Starting from zero, denoting an object’s category. (Person→0, Forklift→1, NovaCarter→2, Transporter→3, FourierGR1T2→4, AgilityDigit→5.)
object_id	int	Positive, unique ID per scene & class. Remains constant across all cameras within the same scene and class.
frame_id	int	Zero-based frame index within that scene.
x, y, z,	float	3D coordinates of the bounding-box centroid in the world coordinate system which is in meters.
width, length, height	float	Box dimensions in meters along its x (width), y (length) and z (height) axes of the object-centered coordinate system, with the origin at the centroid.
yaw	float	Euler angle in radians about the y-axis of the object-centered coordinate system defining the box’s heading in the world coordinate system. (Pitch and roll are assumed zero.)

Example: in scene 0, if a Person is assigned obj_id = 5, then a Forklift cannot use obj_id = 5 (it must use a different ID, e.g. 6).

Archive the text file as track1.zip or track1.tar.gz before uploading.

- Evaluation

Scores are computed with 3D HOTA [1], which jointly balances detection, association and localization quality. HOTA score will be computed per class within a scene which will be averaged. A weighted average will then be computed on these scores across all scenes based on the total no. of objects. 3D IoU will be used for matching GT & prediction objects.

- - Leaderboard = raw HOTA on the hidden test set.
  - Online-tracker bonus: If your paper + code prove that only past frames are used, a +10 % multiplicative bonus is applied when deciding the final winner and runner-up (the public leaderboard itself shows the un-bonused score).

Example: Team A (offline) = 69 % HOTA; Team B (online) = 61 % ⇒ bonus → 67.1 % HOTA. Team B ranks higher in the final award list.

- Data Access

Note: Depth files are huge (> 3 TB). If bandwidth or disk is limited, download only the other files.

By downloading you agree to the Physical AI Smart Spaces licence (CC-BY 4.0).

References

[1] J. Luiten et al., “HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking,” IJCV, 2021.