2026 Challenge Track Description

Track 1: Multi-Camera 3D Perception (Sim2Real)

  • Overview

Challenge Track 1 tackles multi-camera 3D perception in large-scale indoor environments, requiring participants to detect and track people and mobile objects, including autonomous mobile robots (AMRs), humanoids, and forklifts, while maintaining consistent identities within and across all cameras in a scene.

Building on the 2025 edition, which introduced over 500 synthetic camera views generated via NVIDIA Omniverse along with 2D/3D bounding boxes, depth maps, and detailed calibration metadata, the 2026 edition advances the benchmark in two key directions. First, the synthetic training corpus is further expanded in scene diversity and annotation fidelity, now generated using the Isaacsim.Replicator.Agent (IRA) and Isaacsim.Replicator.Object (IRO) extensions on the NVIDIA Omniverse platform, covering warehouse, hospital, retail, and office layouts. Second, real-world test sets are introduced to explicitly evaluate Sim2Real generalization, pushing participants beyond purely synthetic benchmarks toward deployable perception systems.

Evaluation continues to use the 3D Higher Order Tracking Accuracy (HOTA) metric, which jointly balances detection, association, and localization quality. Submissions demonstrating online tracking, relying only on past-frame information, receive a +10% multiplicative bonus when determining the final winner and runner-up.

  • Task

Teams should detect every object and keep the same identity ID while they move within and across all cameras in a scene.

  • Submission Format

For compatibility with the official evaluation server, results must be a single plain-text file (track1.txt) where each line describes one detection:

〈scene_id〉 〈class_id object_id frame_id x y 〈z width 〈length height yaw

FieldTypeDescription
scene_idintUnique identifier for each multi-camera sequence.
class_idintStarting from zero, denoting an object’s category. (Person→0, Forklift→1, NovaCarter→2, Transporter→3, FourierGR1T2→4, AgilityDigit→5.)
object_idintPositive, unique ID per scene & class. Remains constant across all cameras within the same scene and class.
frame_idintZero-based frame index within that scene.
x, y, z,float3D coordinates of the bounding-box centroid in the world coordinate system which is in meters.
width, length, heightfloatBox dimensions in meters along its x (width), y (length) and z (height) axes of the object-centered coordinate system, with the origin at the centroid.
yawfloatEuler angle in radians about the y-axis of the object-centered coordinate system defining the box’s heading in the world coordinate system. (Pitch and roll are assumed zero.)

Example: in scene 0, if a Person is assigned obj_id = 5, then a Forklift cannot use obj_id = 5 (it must use a different ID, e.g. 6).

Archive the text file as track1.zip or track1.tar.gz before uploading.

Important note on the submission file:

      • All floating-point numbers in the submission file must be rounded to two decimal places.
      • The file size limit for each submission is 50 MB.
 
  • Evaluation

Scores are computed with 3D HOTA [1], which jointly balances detection, association and localization quality. HOTA score will be computed per class within a scene which will be averaged. A weighted average will then be computed on these scores across all scenes based on the total no. of objects. 3D IoU will be used for matching GT & prediction objects.

      • Leaderboard = raw HOTA on the hidden test set.
      • Online-tracker bonus: If your paper + code prove that only past frames are used, a +10 % multiplicative bonus is applied when deciding the final winner and runner-up (the public leaderboard itself shows the un-bonused score).

Example: Team A (offline) = 66 % HOTA; Team B (online) = 61 % ⇒ bonus → 67.1 % HOTA. Team B ranks higher in the final award list.

  • Data Access

[Details will be provided soon.]

References

[1] J. Luiten et al., “HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking,” IJCV, 2021.