FAQs

General

1. We would like to participate. What do we need to do?

Please fill out this participation intent form to list your institution, your team and the tracks you will participate in. You just need to follow the instructions and submit the form.

2. I am interested only in submitting a paper but not in the Challenge. Can I do that?

Yes. Please make sure to submit your paper by the submission deadline.

3. How large can a team be?

There are no restrictions on team size.

4. What are the rules for downloading the data set?

A participation agreement is available ahead of the data being shared. You need to accept that agreement and submit that response ahead of getting access to the dataset.

5. Can I use any available data set to train models for detecting vehicles in this Challenge?

Yes. There is no constraint in terms of model and approach used for performing your tasks. You are free to use whatever method you deem best.

6. What are the prizes?

This information is shared in the Awards section.

7. Will we need to submit our code?

Winning teams will need to submit their code for verification purposes so that organizers can ensure that the tasks were performed by algorithms and not humans.

8. How will the submissions be evaluated?

The submission formats for each track are detailed on the Data and Evaluation page.

9. When is the deadline to submit our final evaluation results?

Evaluation results are due on May 10 at 9 AM Pacific. Please see the updated timeline. The submission system will be opened again after a few days and will allow teams to submit additional results, but these results will not be considered when choosing winning teams for the challenge.

10. How long should submitted research/challenge papers be?

Both research and challenge papers should be 6-8 pages in length and follow the CVPR format.

11. When is the deadline of CVPRW paper submission for review? We only see the deadline of camera-ready paper submission on the webpage. But there is no deadline given for paper submission for review.

The paper deadline is May 16, which should be as close to camera ready for review as possible, since it will be reviewed in a very short time frame.

12. Are we allowed to use our own annotated data or training data from other datasets in this challenge?

Yes. Teams are encouraged to leverage the state-of-the-art in domain transfer to improve their performance. But please be mindful that the winning teams and runners-up will be requested to make their training code and inference code open-source, for the purpose of validation, like all the previous AI City Challenges. They also need to clearly state the composition of their training set. The organizing committee need to ensure there is no manual annotation on the test data of this challenge, and all the experimental results can be reproduced in an automated manner.

13. Are we allowed to label (part of) the testing data for transfer learning? Or can we treat the testing set(s) as unlabeled data for semi-supervised learning?
 
Additional annotation on our testing data is strictly prohibited. We also do not encourage the use of testing data in any way during training, with or without label, because the task is supposed to be fairly evaluated in real life where we don’t have access to testing data at all. Finally, please keep in mind that, like all the previous AI City Challenges, all the winning methods and runners-up will be requested to make their code open-source for validation purposes, in which the used training data need to be clearly stated to confirm that their performance is reproducible. That is why the date of determination of winners is later than the challenge submission deadline. 
 

Track 1

1. In some scenarios, some misalignment of synchronization is observed even after adding the time offset. Why does it happen?

Note that due to noise in video transmission, which is common in real deployed systems, some frames are skipped within some videos, so they are not perfectly aligned.

2. What is the format of the baseline segmentation results by Mask R-CNN? How can they be decoded.

Each line of the segmentation results corresponds to the output of detection in “train(test)///det/det_mask_rcnn.txt.”

To generate the segmentation results, we adopt the implementation of Mask R-CNN within Detectron: https://github.com/facebookresearch/Detectron

Each segmentation mask is the representation after processing vis_utils.convert_from_cls_format() within detectron/utils/vis.py. It can be visualized/displayed using other functions in the vis_utils.

3. How can we use the file ‘calibration.txt’? It is a matrix from GPS to 2D image pixel location and there are some tools about image-to-world projections in amilan-motchallenge-devkit/utils/camera, but how can we use the code correctly?

Many of you expressed concerns about the inferior projection accuracy of our provided calibration baseline for Track 1. It is mainly caused by the low precision of the calibration parameters (up to 7 digits after a decimal point). We have updated our calibration tool to enable maximum possible precision in output (up to 15 digits after a decimal point). The updated calibration results, configuration parameters and visualization are all available here. Details are described in ReadMe.txt.

The tool we used for calibration is also publicly available here. We mainly rely on the OpenCV libraries for homography operations. For projection from GPS to 2D pixel location, you may simply apply matrix multiplication with our provided homography matrix. For the back projection from 2D pixel location to GPS, first calculate the inverse of the homography matrix (using invert() in OpenCV), and then apply matrix multiplication. The image-to-world projection methods in “amilan-motchallenge-devkit/utils/camera” can also be helpful in similar way. Please also feel free to use any other calibration technique of your choice to generate your own calibration results.

In the updated calibration results, the intrinsic parameter matrices and distortion coefficients are provided for fish-eye cameras (one in the training set and the other in the test set). Note that though the GPS positions are represented as angular values instead of coordinates on a flat plane, since the longest distance between two cameras is very small (3 km) compared to the perimeter of the earth (40,075 km), they can still be safely viewed as a linear coordinate system.

4. Without the intrinsic camera parameters, how can we correct radial distortion for fisheye cameras?

There are many simple ways to approximate intrinsic camera parameters. For example, the focal length can be chosen as the frame width in pixel. The principal point can be assumed to be at the frame center. The aspect ratio and skew can be set as 1 and 0, respectively. For radial distortion correction, you may apply cv::undistort() from the OpenCV library, with the approximate intrinsic camera matrix and provided distortion coefficients. Last but not least, keep in mind that the camera parameters are not provided with the videos, so all the given homography matrices and distortion coefficients are manually derived. Feel free to use your own chosen methods to improve camera calibration if necessary.

5. Do we need to consider the condition that a car appears in multiple scenarios? 

In Track 1, there is no need to consider vehicles appearing across scenarios, so that the provided camera geometry can be utilized for cross-camera tracking. But in Track 2, all the IDs across cameras are mixed in the training set and the test set, which is a different problem to solve.

6. How are the car IDs used for evaluation? In the training data there are ~200 IDs. But when working on the test set, the tracker may generate arbitrary car IDs. Do they need to be consistent with the ground truths? Is the evaluation based on the IOU of the tracks?

We use the same metrics as MOTChallenge for the evaluation. Please refer to the evaluation tool in the package for more details. The IDs in the submitted results do not need to match the exact IDs in the ground truths. We will use bipartite matching for their comparison, which will be based on IOU of bounding boxes.

7. We observed some cases that the ground truths and annotated bounding boxes are not accurate. What is the standard of labeling?

Vehicles were NOT labeled when: (1) They did not travel across multiple cameras; (2) They overlapped with other vehicles and were removed by NMS (Only vehicles in the front are annotated); (3) They were too small in the FOV (bounding box area smaller than 1,000 pixels); (4) They were cropped at the edge of the frames with less than 2/3 of the vehicle body visible. Additionally, the bounding boxes were usually annotated larger normal to ensure full coverage of each entire vehicle, so that attributes like vehicle color, type and pose can be reliably extracted to improve re-identification. More specifically, the width and height of each bounding box were both extended by about 20 pixels from the center.

Track 2

1. The ReadMe file in Track 2 data shows that  333 vehicles are used for training. But the vehicle IDs in train_label files are from 1 to 478. Why the vehicle IDs are different between the ReadMe file and train_label files?

The ranges of the training IDs are: 1-95 & 241-478

The ranges of the testing IDs are: 96-240 & 479-666

Though the maximum ID in the training set is 478, there are actually only 333 IDs.

2. Does the order of the k matches for each query matter in the submission file?

YES. To achieve higher accuracy score, your algorithm must be able to rank more true positives on top.

3. Why the camera IDs are not available for the test set?

We didn’t release the camera IDs for the test set, because we want to make Track 1 (MTMC tracking) and Track 2 (image-based re-identification) different from each other. You may leverage camera information in the task of Track 1. Nonetheless, we provided the track/trajectory information, so it is still possible to perform video-based re-identification.

Track 3

1. In the instructions of the submission format there is an ambiguity concerning the value . Does it refer to the time at which an anomaly starts (is detected) or the time at which the anomaly score is the highest? Moreover, what happens with the duration of an anomaly? Are we interested in this information during evaluation process? Is this incorporated in a way in the aforementioned timestamp value?

Teams should indicate only the starting point of the anomaly, e.g., when the first vehicle hit another vehicle or ran off the road. The duration of the anomaly does not need to be reported.

2. Concerning videos with multiple anomalies, if a second anomaly occurs while the first anomaly is still in progress should we identify it as a new anomaly?

No, only one anomaly should be reported. We have clarified the evaluation page to reflect this. In particular, if a second anomaly happens within 2 minutes of the first, it should be counted the same anomaly as the first (i.e., a multi-car pileup is treated as one accident).

3. Finally in the submission file, should we send only the anomalies detected or the top 100 scores, concerning the most possible abnormal events or could it be less/more?

You should not submit more than 100 predicted anomalies. The evaluation strategy is designed to penalize false positives. As such you should only submit N high confidence results, where N <= 100.

4. For the “timestamp”, what kind of format should we follow? In your evaluation page, you mention “ is the relative time, in seconds, from the start of the video (e.g., 12.3456)” But in you training data, you give “2 587 894”. Which one is correct? “587” vs “12.2456”?

The timestamp 587 refers to 587.0 seconds from the onset of the video, i.e. 9 minutes and 47 seconds into the video. If you look at video 2, you will see a car stopping on the side of the road around 9 minutes and 47 seconds into the video. As specified in the evaluation page, the submission file should contain the timestamp as float.

5. If our confidence scores are all binary value, like 0 or 1 of each frame, how do you handle this case?

The confidence score should be between 0 and 1. It is not currently used in the evaluation but may be used in the future. As such, it would be beneficial to include confidence scores if possible.

6. Are there any clear definition of anomaly detection? If there is a vehicle suddenly stop on the road, the starting time is count from the  vehicle slowing down or stop completely. For the training data 11.mp4, I don’t know where is the anomaly.

There is no exact definition of anomaly but  basically it refers to anything we don’t expect to happen normally. The most frequent anomalies shown in the training set are stalled vehicles and crashes.

For a stalled vehicle, the anomaly start time is the time when the vehicle comes to a complete stop. For a single vehicle crash or a multiple vehicle crash, the start time is the time instant when the first crash occurs.

In training data 11.mp4, there is a stalled vehicle in the ditch. Please refer to the attached image.

7. Is it possible to provide the GPS locations of the cameras that took the videos in Track 3? It would help to get the calibration done.

The cameras that have been used to record the videos of Track 3 are PTZ type (Pan, Tilt, Zoom). Hence, the camera orientation and zoom can change dynamically. So, the GPS locations of the cameras are not provided.

8. Can you please explain the “10 seconds time window” for true positive detection?
 
For the purpose of computing the F1-score, a true-positive (TP) detection will be considered as the predicted anomaly within 10 seconds of the true anomaly (i.e.,  seconds before or after) that has the highest confidence score. For example, if the groundtruth incident start time and end time in a particular video are 454 seconds and 567 seconds respectively, any submission between 444 seconds and 577 seconds will be considered as a true positive (10 seconds before groundtruth start time and 10 seconds after groundtruth end time). So, this doesn’t mean that the max RMSE will be 10 seconds. 
RMSE does not account for the TP status of a prediction, so the prediction may be 5 min or even more before or after the true time of the anomaly. If, all the predictions are at a distance of 5 min or higher from the true time of the anomaly, the RMSE will be > 300 and S3 will be 0, irrespective of the F1 score (which would also be 0, as there would be no TP predictions).