2025 FAQs

General

1. We would like to participate. What do we need to do?

This time the data access information is shared on each track’s description page under CHALLENGE tab. Please find the instructions in there.

2. I am interested only in submitting a paper but not in the Challenge. Can I do that?

Yes. Please make sure to submit your paper by the submission deadline.

3. How large can a team be?

There are no restrictions on team size.

4. Are we allowed to use other external data/pre-trained models?

External dataset or pre-trained models are allowed only if they are public. Teams that wish to be listed in the public leader board and win the challenge awards are NOT allowed to use any private data or private pre-trained model for either training or validation. The winning teams and runners-up are required to submit their training and testing codes for verification after the challenge submission deadline in order to ensure that no private data or private pre-trained model was used for training and the tasks were performed by algorithms and not humans.

5. What are the prizes?

This information is shared in the Awards section.

6. Will we need to submit our code?

Teams need to make there code publicly accessible to be considered for winning (including complete/reproducible pipeline for mode training/creation). This is to ensure that no private data is used for training and the tasks were performed by algorithms and not humans and contribute to the community.

7. How will the submissions be evaluated?

The submission formats for each track are detailed on each track’s description page under CHALLENGE tab.

8. Are we allowed to use validation sets in training?

The validation sets are allowed to be used in training.

9. Are we allowed to use test sets in training?

Additional manual annotations on our testing data are strictly prohibited. We also do not encourage the use of testing data in any way during training, with or without labels, because the task is supposed to be fairly evaluated in real life where we don’t have access to testing data at all. Although it is permitted to perform algorithms like clustering to automatically generate pseudo labels on the testing data, we will choose a winning method without using such techniques when multiple teams have similar performance (~1%). Finally, please keep in mind that, like all the previous editions of the AI City Challenge, all the winning methods and runners-up will be requested to submit their code for verification purposes. Their performance needs to be reproducible using the training/validation/synthetic data only.

10. Are we allowed to use data/pre-trained models from the previous edition(s) of the AI City Challenge?

Data from previous edition(s) of the AI City Challenge are allowed to be used.

11. Do the winning teams and runners-up need to submit papers and present at the workshop?

All the winning teams and runners-up have to submit papers, register and present at the workshop, in order to be qualified for winning.

Track 1 – Multi-Camera 3D Perception

1. Is calibration available for each camera?

The comprehensive camera calibration information is available for each camera, including 3-by-4 camera matrix, intrinsic parameters, extrinsic parameters, etc.

2. What is the standard of labeling visible 2D bounding boxes??

The annotations of the test set are generated based on the same standards as the training and validation set.

- For occluded objects (objects that are blocked by an object within the camera frame), objects must satisfy both the visibility in height and width requirements.
- For objects that are truncated – objects that are cut off via the camera frame, the objects must satisfy EITHER of the conditions in visibility for height OR the visibility for width.
- Here are the definitions for visibility in height and width:
  - Visibility for height
    - If the head is visible and 20% of the height is visible then, label the object.
    - If the head is not visible, then label the object if 60% of the height is visible.
  - Visibility for width
    - More than 60% body width visible should be labeled.

3. How are the object IDs used for evaluation? Do the submitted IDs need to be consistent with the ground truths?

We use the HOTA metric for evaluation. The IDs in the submitted results do not need to match the exact IDs in the ground truths. We will use bipartite matching for their comparison, which will be based on IoU of 3D bounding boxes in the global coordinate system.

Track 2 – Traffic Safety Description and Analysis

1. Is it compulsory for participants to use a generative model to generate captions from the videos? Or we can use the training caption as a ground truth and we can apply a retrieval model to retrieve the closet caption from the training database to submit it on the test set?

The text should describe the new scenario, which may not be found in the training set. Specifically this is not a retrieval task and we are seeking for generative solutions. Teams may submit results to evaluation system and rank on the leaderboard with any method. But we will be manually evaluating award contenders and teams only using retrieval method will be disqualified from winning the awards. For example, considering a method which uses features extracted from the test set videos and retrieves the “closest-meaning” caption from the training set for submission, this will not be qualified for winning the track since it is not a generative solution.

Track 3 – Warehouse Spatial Intelligent

[We will add frequently asked questions with answers here for this new track]

Track 4 – Road Object Detection in Fish-Eye Cameras

1. Is calibration available for each camera?

There is no calibration information for all cameras involved in the train and test sets.

2. Does the evaluation time for FPS include the time taken for loading the model, loading images, and performing pre-processing tasks?

Model loading is not included in the FPS calculation; however, preprocessing for individual images is included.

3. Is inference using batch processing allowed, or should all images be processed individually?

Batch processing is not permitted; images must be processed individually to simulate real-time application.

4. Is it permissible to initiate another parallel process for loading images while the inference process is running?

To have a straightforward evaluation, parallel processing of any kind is prohibited; all operations must be performed sequentially for all 1000 images. Please refer to question #13 below for the updated pseudocode which provides a detailed overview of the evaluation process.

5. Can we prepare image format before go to inference phase or the converting must be calculated in the inference?

It is not allowed to convert the size and the format of the images. Input format should be the test set as given.

6. Is the F1 score, as calculated by the harmonic mean formula, normalized to a range of 0 to 1?

Yes, the F1 score is a normalized, ranging from 0 to 1. The evaluation script has been provided to the teams, and you can see how the F1 score is computed. You can also use that script to compute the score on the validation set.

Please note that, during the challenge, the evaluation system will show a partial evaluation result on a 50% subset of the test set. At the end of the challenge, the evaluation score on the full test set will be revealed. This is to ensure that your model is able to generalize beyond a specific test set.

7. Could you please clarify if participants are required to use a specific image loading library, such as cv2.imread, or are we allowed to develop our own image loading process or use an alternative library such as PIL.Image.open as long as the loading is done sequentially, image-by-image, and within the evaluation loop?

It is allowed to choose or develop the library for reading the image from the FishEye1Keval folder.

8. Is there a template available for creating a Docker file?

Teams competing for the challenge prizes must upload their Docker images to Docker Hub. We have provided the instructions and two Docker file templates here for:

– Measuring FPS on the Jetson AGX Orin (64GB)

– Training the model(s)

9. Can you share reference FPS results for any model run on the Jetson AGX Orin?

Using the Docker image template with the YOLOv11n model on the Jetson AGX Orin (64GB), the console output is shown as:

Processed 1000 images in 113.19 seconds.
--- Evaluation Complete ---
Total inference time: 113.19 seconds
FPS: 8.83
Normalized FPS: 0.3534

10. Could you share the exact versions (or version ranges) of JetPack, Operating System, Docker for the competition environment?

We will use the following versions on the Jetson AGX Orin (64GB):

– JetPack: nvidia-jetpack (6.1)

– Operating System: Ubuntu 22.04.5 LTS

– Docker version 27.5.1

11. Could you please provide the average per-image time spent on the following processing steps for the different YOLOv11 variants (e.g., YOLOv11n, YOLOv11s, YOLOv11m): Image Loading Preprocessing, Inference, Postprocessing, and Result Generation (e.g., JSON output)?

Note: the evaluation rule has changed please refer to QA#13 and #17 for updated information. This QA is kept here for record.

We have updated the run_evaluation_jetson.py to provide more detailed breakdowns for each process. Please see the console results for YOLOv11n and YOLOv11s below:

YOLO11n
Processed 1000 images in 114.73 seconds.
Avg Image Load Time : 80.67 ms
Avg Inference Time : 33.37 ms
Avg Postprocessing Time : 0.68 ms
Avg Total Time : 114.71 ms
--- Evaluation Complete ---
Total inference time: 114.73 seconds
FPS: 8.72
Normalized FPS: 0.3487

YOLO11s
Processed 1000 images in 122.77 seconds.
Avg Image Load Time : 80.12 ms
Avg Inference Time : 42.12 ms
Avg Postprocessing Time : 0.52 ms
Avg Total Time : 122.76 ms
--- Evaluation Complete ---
Total inference time: 122.77 seconds
FPS: 8.15
Normalized FPS: 0.3258

12. Could you specify which Jetson power mode (nvpmodel setting) was used during your timing and evaluation? Can we set the clock frequencies of the CPU and GPU to a maximum by running: sudo jetson_clocks?

30W power mode is used and the clock frequencies are set at maximum.

13. Since YOLOv11n achieves approximately 8 FPS, meeting the 10 FPS threshold may be difficult even with the smallest baseline model. Could adjustments be made to the evaluation process to address this challenge?

We have made the last update to the evaluation process to exclude image loading time from the FPS (frames per second) calculation, focusing only on the time spent on preprocessing, inference, and postprocessing. This change was made because image loading time is substantially longer than processing time in the targeted FPS evaluation scenario. Please review the revised pseudocode:

BEGIN
SET sum_time = 0 
SET results = []
SET max_fps = 25
FOR each image IN image_folder (1000 images)
    LOAD image (must use OpenCV imread() function)
    SET timer_start = CURRENT_TIME
    PREPROCESS image
    PERFORM_INFERENCE on image
    POSTPROCESS inference_result
    SAVE result
    SET timer_end = CURRENT_TIME
    SET elapsed_time = timer_end - timer_start
    CALCULATE sum_time = sum_time + elapsed_time
END_FOR
CALCULATE fps = 1000 / sum_time
CALCULATE norm_fps = min(fps, max_fps) /max_fps
CALCULATE f1-score based on the result
CALCULATE Metric (harmonic mean of norm_fps and f1-score)
DISPLAY (all metrics)
END

14. During the inference process, can we predict each image using multiple models running in parallel, meaning several models running at the same time to predict a single image?

Parallel inferencing is permitted, enabling multiple models to process the same image simultaneously to speed up the process.

15. If I have a library that takes an image .png and exports it to tensor, Does it count to preprocessing time?

To align with the updated FPS calculation and ensure consistency across teams, only the OpenCV imread() function should be used for image loading in the chosen programming language.

16. Are we allowed to utilize publicly available resources from last year’s winning teams? Additionally, is it permissible to pseudo-label the FishEye1K images or apply data augmentation for training purposes?

Using public resources is permitted. However, manually labeling the test set is not allowed. Following the post-challenge deadline, we will assess the top 5-10 teams using an unpublished, in-house test set to determine the final rankings. This test set will fully represent the FishEye8K and FishEye1K datasets. Thus, excessive finetuning on FishEye1Keval may lead to a lower performance on the in-house test set.

17. Could you please provide a detailed timing breakdown (per image) of the evaluation runs for different YOLOv11 variants (e.g., YOLOv11n, YOLOv11s, YOLOv11m) under MAXN mode, which enables all CPU and GPU cores at maximum frequency using: sudo nvpmodel -m 0?

Experiments are executed in 30W mode with maximized clock frequencies. Certain YOLOv11 variants cannot be converted to TensorRT, so we only list the compatible ones below:

YOLOv11n

Avg Image Preprocess Time: 0.00 ms
Avg Inference Time : 13.12 ms
Avg Postprocess Time: 0.60 ms
Avg Processing Time: 13.734 ms

--- Evaluation Complete ---

Total time: 76.36 seconds
Total processing time: 13.73 seconds
FPS: 72.85
Normalized FPS: 1.000

YOLOv11s

Avg Image Preprocess Time: 0.00 ms
Avg Inference Time : 15.68 ms
Avg Postprocess Time: 0.36 ms
Avg Processing Time: 16.04 ms

--- Evaluation Complete ---

Total time: 78.71 seconds
Total processing time: 16.04 seconds
FPS: 62.33
Normalized FPS: 1.000

Below are FPS measurements for YOLOv11 variants not converted to TensorRT:    YOLOv11x    -     18fps,
    YOLOv11l    -      22.8fps,
    YOLOv11m   -     30.43fps,
    YOLOv11s   -     35.65fps,
    YOLOv11n    -     36.21fps

18. Are there specific deadlines, guidelines, and instructions for Docker submission?

The Docker container submission deadline is July 18th, 11:59 PM Anywhere on Earth. Teams should check the workshop website, particularly the FAQs, for detailed guidelines. The evaluation process for teams in the top 5-10 on the public leaderboard is as follows:

1. Verify submission by the July 18th, midnight AoE deadline; late submissions will be excluded.
2. Exclude submissions that teams report as non-compliant with challenge rules, e.g., workshop papers rejected.
3. Confirm the model achieves an FPS greater than 10; teams failing this will be excluded.
4. Train the model(s) on FishEye1Keval to verify the leaderboard F1-score is reproduced.
5. Evaluate the model on the in-house test set for final ranking, based on the harmonic mean of F1-score and normalized FPS.
6. If there are fewer than two winners, we will further examine the teams beyond top 10 teams.