2025 FAQs
General
1. We would like to participate. What do we need to do?
This time the data access information is shared on each track’s description page under CHALLENGE tab. Please find the instructions in there.
2. I am interested only in submitting a paper but not in the Challenge. Can I do that?
Yes. Please make sure to submit your paper by the submission deadline.
3. How large can a team be?
There are no restrictions on team size.
4. Are we allowed to use other external data/pre-trained models?
External dataset or pre-trained models are allowed only if they are public. Teams that wish to be listed in the public leader board and win the challenge awards are NOT allowed to use any private data or private pre-trained model for either training or validation. The winning teams and runners-up are required to submit their training and testing codes for verification after the challenge submission deadline in order to ensure that no private data or private pre-trained model was used for training and the tasks were performed by algorithms and not humans.
5. What are the prizes?
This information is shared in the Awards section.
6. Will we need to submit our code?
Teams need to make there code publicly accessible to be considered for winning (including complete/reproducible pipeline for mode training/creation). This is to ensure that no private data is used for training and the tasks were performed by algorithms and not humans and contribute to the community.
7. How will the submissions be evaluated?
The submission formats for each track are detailed on each track’s description page under CHALLENGE tab.
The validation sets are allowed to be used in training.
9. Are we allowed to use test sets in training?
Additional manual annotations on our testing data are strictly prohibited. We also do not encourage the use of testing data in any way during training, with or without labels, because the task is supposed to be fairly evaluated in real life where we don’t have access to testing data at all. Although it is permitted to perform algorithms like clustering to automatically generate pseudo labels on the testing data, we will choose a winning method without using such techniques when multiple teams have similar performance (~1%). Finally, please keep in mind that, like all the previous editions of the AI City Challenge, all the winning methods and runners-up will be requested to submit their code for verification purposes. Their performance needs to be reproducible using the training/validation/synthetic data only.
10. Are we allowed to use data/pre-trained models from the previous edition(s) of the AI City Challenge?
Data from previous edition(s) of the AI City Challenge are allowed to be used.
11. Do the winning teams and runners-up need to submit papers and present at the workshop?
Track 1 – Multi-Camera 3D Perception
1. Is calibration available for each camera?
The comprehensive camera calibration information is available for each camera, including 3-by-4 camera matrix, intrinsic parameters, extrinsic parameters, etc.
2. What is the standard of labeling visible 2D bounding boxes??
The annotations of the test set are generated based on the same standards as the training and validation set.
- For occluded objects (objects that are blocked by an object within the camera frame), objects must satisfy both the visibility in height and width requirements.
- For objects that are truncated – objects that are cut off via the camera frame, the objects must satisfy EITHER of the conditions in visibility for height OR the visibility for width.
- Here are the definitions for visibility in height and width:
- Visibility for height
- If the head is visible and 20% of the height is visible then, label the object.
- If the head is not visible, then label the object if 60% of the height is visible.
- Visibility for width
- More than 60% body width visible should be labeled.
- Visibility for height
3. How are the object IDs used for evaluation? Do the submitted IDs need to be consistent with the ground truths?
We use the HOTA metric for evaluation. The IDs in the submitted results do not need to match the exact IDs in the ground truths. We will use bipartite matching for their comparison, which will be based on IoU of 3D bounding boxes in the global coordinate system.
Track 2 – Traffic Safety Description and Analysis
1. Is it compulsory for participants to use a generative model to generate captions from the videos? Or we can use the training caption as a ground truth and we can apply a retrieval model to retrieve the closet caption from the training database to submit it on the test set?
The text should describe the new scenario, which may not be found in the training set. Specifically this is not a retrieval task and we are seeking for generative solutions. Teams may submit results to evaluation system and rank on the leaderboard with any method. But we will be manually evaluating award contenders and teams only using retrieval method will be disqualified from winning the awards. For example, considering a method which uses features extracted from the test set videos and retrieves the “closest-meaning” caption from the training set for submission, this will not be qualified for winning the track since it is not a generative solution.
Track 3 – Warehouse Spatial Intelligent
[We will add frequently asked questions with answers here for this new track]
Track 4 – Road Object Detection in Fish-Eye Cameras
1. Is calibration available for each camera?
There is no calibration information for all cameras involved in the train and test sets.
2. Does the evaluation time for FPS include the time taken for loading the model, loading images, and performing pre-processing tasks?
Model loading is not included in the FPS calculation; however, preprocessing for individual images is included.
3. Is inference using batch processing allowed, or should all images be processed individually?
Batch processing is not permitted; images must be processed individually to simulate real-time application.
4. Is it permissible to initiate another parallel process for loading images while the inference process is running?
To have a straightforward evaluation, parallel processing of any kind is prohibited; all operations must be performed sequentially for all 1000 images. Please refer to question #13 below for the updated pseudocode which provides a detailed overview of the evaluation process.
5. Can we prepare image format before go to inference phase or the converting must be calculated in the inference?
It is not allowed to convert the size and the format of the images. Input format should be the test set as given.
6. Is the F1 score, as calculated by the harmonic mean formula, normalized to a range of 0 to 1?
7. Could you please clarify if participants are required to use a specific image loading library, such as cv2.imread, or are we allowed to develop our own image loading process or use an alternative library such as PIL.Image.open as long as the loading is done sequentially, image-by-image, and within the evaluation loop?
8. Is there a template available for creating a Docker file?
9. Can you share reference FPS results for any model run on the Jetson AGX Orin?
Using the Docker image template with the YOLOv11n model on the Jetson AGX Orin (64GB), the console output is shown as:
Processed 1000 images in 113.19 seconds.
--- Evaluation Complete ---
Total inference time: 113.19 seconds
FPS: 8.83
Normalized FPS: 0.3534
10. Could you share the exact versions (or version ranges) of JetPack, Operating System, Docker for the competition environment?
11. Could you please provide the average per-image time spent on the following processing steps for the different YOLOv11 variants (e.g., YOLOv11n, YOLOv11s, YOLOv11m): Image Loading Preprocessing, Inference, Postprocessing, and Result Generation (e.g., JSON output)?
YOLO11n
Processed 1000 images in 114.73 seconds.
Avg Image Load Time : 80.67 ms
Avg Inference Time : 33.37 ms
Avg Postprocessing Time : 0.68 ms
Avg Total Time : 114.71 ms
--- Evaluation Complete ---
Total inference time: 114.73 seconds
FPS: 8.72
Normalized FPS: 0.3487
YOLO11s
Processed 1000 images in 122.77 seconds.
Avg Image Load Time : 80.12 ms
Avg Inference Time : 42.12 ms
Avg Postprocessing Time : 0.52 ms
Avg Total Time : 122.76 ms
--- Evaluation Complete ---
Total inference time: 122.77 seconds
FPS: 8.15
Normalized FPS: 0.3258
12. Could you specify which Jetson power mode (nvpmodel setting) was used during your timing and evaluation? Can we set the clock frequencies of the CPU and GPU to a maximum by running: sudo jetson_clocks?
13. Since YOLOv11n achieves approximately 8 FPS, meeting the 10 FPS threshold may be difficult even with the smallest baseline model. Could adjustments be made to the evaluation process to address this challenge?
BEGIN
SET sum_time = 0
SET results = []
SET max_fps = 25
FOR each image IN image_folder (1000 images)
LOAD image (must use OpenCV imread() function)
SET timer_start = CURRENT_TIME
PREPROCESS image
PERFORM_INFERENCE on image
POSTPROCESS inference_result
SAVE result
SET timer_end = CURRENT_TIME
SET elapsed_time = timer_end - timer_start
CALCULATE sum_time = sum_time + elapsed_time
END_FOR
CALCULATE fps = 1000 / sum_time
CALCULATE norm_fps = min(fps, max_fps) /max_fps
CALCULATE f1-score based on the result
CALCULATE Metric (harmonic mean of norm_fps and f1-score)
DISPLAY (all metrics)
END
14. During the inference process, can we predict each image using multiple models running in parallel, meaning several models running at the same time to predict a single image?
15. If I have a library that takes an image .png and exports it to tensor, Does it count to preprocessing time?
16. Are we allowed to utilize publicly available resources from last year’s winning teams? Additionally, is it permissible to pseudo-label the FishEye1K images or apply data augmentation for training purposes?
17. Could you please provide a detailed timing breakdown (per image) of the evaluation runs for different YOLOv11 variants (e.g., YOLOv11n, YOLOv11s, YOLOv11m) under MAXN mode, which enables all CPU and GPU cores at maximum frequency using: sudo nvpmodel -m 0?
YOLOv11n
Avg Image Preprocess Time: 0.00 ms
Avg Inference Time : 13.12 ms
Avg Postprocess Time: 0.60 ms
Avg Processing Time: 13.734 ms
--- Evaluation Complete ---
Total time: 76.36 seconds
Total processing time: 13.73 seconds
FPS: 72.85
Normalized FPS: 1.000
YOLOv11s
Avg Image Preprocess Time: 0.00 ms
Avg Inference Time : 15.68 ms
Avg Postprocess Time: 0.36 ms
Avg Processing Time: 16.04 ms
--- Evaluation Complete ---
Total time: 78.71 seconds
Total processing time: 16.04 seconds
FPS: 62.33
Normalized FPS: 1.000
Below are FPS measurements for YOLOv11 variants not converted to TensorRT: YOLOv11x - 18fps,
YOLOv11l - 22.8fps,
YOLOv11m - 30.43fps,
YOLOv11s - 35.65fps,
YOLOv11n - 36.21fps
18. Are there specific deadlines, guidelines, and instructions for Docker submission?
The Docker container submission deadline is July 18th, 11:59 PM Anywhere on Earth. Teams should check the workshop website, particularly the FAQs, for detailed guidelines. The evaluation process for teams in the top 5-10 on the public leaderboard is as follows:
- Verify submission by the July 18th, midnight AoE deadline; late submissions will be excluded.
- Exclude submissions that teams report as non-compliant with challenge rules, e.g., workshop papers rejected.
- Confirm the model achieves an FPS greater than 10; teams failing this will be excluded.
- Train the model(s) on FishEye1Keval to verify the leaderboard F1-score is reproduced.
- Evaluate the model on the in-house test set for final ranking, based on the harmonic mean of F1-score and normalized FPS.
- If there are fewer than two winners, we will further examine the teams beyond top 10 teams.