2019 Data and Evaluation – AI CITY CHALLENGE

Data and Evaluation Method

IMPORTANT NOTICE: Evaluation system is now open. See ‘Challenge’->’Evaluation System’ for instructions or click here.

Data Sets

We are excited to share that we have a unique data set along with redaction of faces and license plates for this year’s Challenge. Data comes from multiple traffic cameras from a city in the United States, as well as from state highways in Iowa. Specifically, we have time-synchronized video feeds from several traffic cameras spanning major travel arteries of the city. The vantage point of these cameras is for traffic and transportation purposes.

Urban Intersection and Highway Data – Nearly 3 hours of synchronized videos synchronously captured from multiple vantage points at various urban intersections and along highways. Videos are 960p or better, and most have been captured at 10 frames per second.
Iowa State University Data – More than 25 hours of video data captured on highways in Iowa.
Metadata about the collected videos, including GPS locations of cameras, camera calibration information and other derived data from videos.

Download Links

Track 1: City-Scale Multi-Camera Vehicle Tracking (Size: 16.2 GB)

Track1-download

Track 2: City-Scale Multi-Camera Vehicle Re-Identification (Size: 1.7 GB)

Track2-download

Track 3: Traffic Anomaly Detection (Size: 11.5 GB)

Track3-download

Evaluation and Submission

For each of the three challenge tasks, a different dataset will be provided as a set of videos or images. Associated numeric video IDs for each track are obtained by sorting track videos (or the name of the folders in which they are stored) in alphanumeric order, which numbering starting at 1. All the pixel coordinates are 0-based for all the 3 tracks.

Frame Extraction

Submissions for some tracks will require frame IDs for frames that contain information of interest. In order to ensure frame IDs are consistent across teams, we suggest that all teams use the FFmpeg library (https://www.ffmpeg.org/) to extract/count frames.

Submission Policy

Detailed submission policy will be updated soon.

Track 1: City-Scale Multi-Camera Vehicle Tracking

The dataset contains 3.25 hours (195.03 minutes) of videos collected from 40 cameras spanning 10 intersections in a mid-sized U.S. city. The distance between the two furthest simultaneous cameras is 2.5 km. The dataset covers a diverse set of location types, including intersections, stretches of roadways, and highways. The dataset is divided into 5 scenarios. Only 3 of the scenarios are used for training, and the remaining 2 are used for testing. The length of the training videos is 58.43 minutes, while testing videos are 136.60 minutes in length. In total, the dataset contains 229,680 bounding boxes for 666 distinct annotated vehicle identities. Only vehicles passing through at least 2 cameras have been annotated. The resolution of each video is at least 960p and the majority of the videos have a frame rate of 10 FPS. Additionally, in each scenario, the offset from the start time is available for each video, which can be used for synchronization. Please refer to the ReadMe.txt file for more details.

Task

Teams should detect and track targets across multiple cameras. Baseline detection and single-camera tracking results are provided, but teams are also allowed to use their own methods.

Submission Format

One text file should be submitted containing, on each line, details of a detected and tracked vehicle, in the following format. Values are space-delimited.

<camera_id> <obj_id> <frame_id> <xmin> <ymin> <width> <height> <xworld> <yworld>

<camera_id> is the camera numeric identifier, between 1 and 40.
<obj_id> is a numeric identifier for each object. It should be a positive integer and consistent for each object identity across multiple cameras.
<frame_id> represents the frame count for the current frame in the current video, starting with 1.
The axis-aligned rectangular bounding box of the detected object is denoted by its pixel-valued coordinates within the image canvas, <xmin> <ymin> <width> <height>, computed from the top-left corner of the image. All values are integers.
<xworld> <yworld> are the GPS coordinates of the projected bottom points of each object. They are not currently used in the evaluation but may be used in the future. As such, it would be beneficial to include them if possible.

The text file containing all predictions should be named track1.txt and can be archived using Zip (track1.zip) or tar+gz (track1.tar.gz) to reduce upload time.

Evaluation

For MTMC tracking, the IDF1 score [1] will be used to rank the performance of each team. IDF1 measures the ratio of correctly identified detections over the average number of ground-truth and computed detections. The evaluation tool provided with our dataset also computes other evaluation measures adopted by the MOTChallenge [2], [3], such as Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP), mostly tracked targets (MT), and false alarm rate (FAR). However, they will NOT be used for ranking purposes. The measures that will be displayed in the evaluation system are IDF1,IDP, IDR, Precision (detection) and Recall (detection).

Track 2: City-Scale Multi-Camera Vehicle Re-Identification

The dataset contains 56,277 images, where 36,935 of them come from 333 object identities form the training set and 18,290 from the other 333 identities in the test set. An additional 1,052 images are used as queries. On average, each vehicle has 84.50 image signatures from 4.55 camera views. Please refer to the ReadMe.txt file for more details.

Task

Teams should find the image(s) in the test set that are from the same identity as the objects in each query image. The training set may be exploited for supervised learning.

Submission Format

One text file should be submitted containing, on each line, a list of the top 100 matches from the test set for each query object, in ascending order of their distance to the query. The delimiter is space. Each match should be represented as the ID of the test image, which is an integer between 1 and 18,290. An example submission is given below, where ID_q,k denotes the test ID for the k’th match of the q’th query.

ID_1,1 ID_1,2 … ID_1,100

ID_2,1 ID_2,2 … ID_2,100

…

ID_1052,1 ID_1052,2 … ID_1052,100

The text file containing all predictions should be named track2.txt and can be archived using Zip (track2.zip) or tar+gz (track2.tar.gz) to reduce upload time.

Evaluation

The metric used to rank the performance of each team will be the mean Average Precision (mAP) [4] of the top-K matches, which measures the mean of average precision (the area under the Precision-Recall curve) over all the queries. In our case, K=100. Our evaluation server may also provide other measures, such as the rank-1, rank-5 and rank-10 hit rates, which measure the percentage of the queries that have at least one true positive result ranked within the top 1, 5 or 10 positions, respectively.

Track 3: Traffic Anomaly Detection

The dataset contains 100 training and 100 test videos, each approximately 15 minutes in length, recorded at 30 fps and 800×410 resolution. Anomalies can be due to car crashes or stalled vehicles. Please note that regular congestion not caused by any traffic incident does not count as an anomaly. The “train-anomaly-results.txt” file in the dataset contains the anomalies in the training videos present in “train-data” folder. The schema is in the following format. Values are space-delimited, without headers.

<video_id> <start timestamp> <end timestamp>

<video_id> is the video numeric identifier, starting with 1. It represents the position of the video in the list of all track videos, sorted in alphanumeric order.
<start timestamp> is the anomaly start time, in seconds, from the start of the video.
<end timestamp> is the anomaly end time, in seconds, from the start of the video.

For example, a line with “2 587 894” means that the 2.mp4 video in the “train-data” folder contains an anomaly with the start timestamp 587, referring to 587.0 seconds from the onset of the video, i.e. 9 minutes and 47 seconds into the video. Similarly, the anomaly end time is 894.0 seconds, i.e., 14 minutes and 54 seconds into the video.

Task

Teams should identify all anomalies present in all 100 test set videos.

Submission Format

One text file should be submitted containing, on each line, details of a detected anomaly, in the following format. Values are space-delimited.

<video_id> <timestamp> <confidence>

<video_id> is the video numeric identifier, starting with 1. It represents the position of the video in the list of all track videos, sorted in alphanumeric order.
<timestamp> is the relative time, in seconds, from the start of the video, denoted as a float (e.g., 12.3456).
<confidence> denotes the confidence of the prediction.

At most 100 anomalies can be included in the submission. The text file containing all predictions should be named track3.txt and can be archived using Zip (track3.zip) or tar+gz (track3.tar.gz) to reduce upload time.

Evaluation

Evaluation for track 3 will be based on model anomaly detection performance, measured by the F1-score, and detection time error, measured by RMSE. Specifically, the track 3 score will be computed as

where is the F1-score and is the normalized root mean square error (RMSE). The score ranges between 0 and 1, and higher scores are better.

For the purpose of computing the F1-score, a true-positive (TP) detection will be considered as the predicted anomaly within 10 seconds of the true anomaly (i.e., seconds before or after) that has the highest confidence score. Each predicted anomaly will only be a TP for one true anomaly. A false-positive (FP) is a predicted anomaly that is not a TP for some anomaly. Finally, a false-negative (FN) is a true anomaly that was not predicted.

We compute the detection time error as the RMSE of the ground truth anomaly time and predicted anomaly time for all TP predictions. In order to eliminate jitter during submissions, normalization will be done using min-max normalization with a minimum value of 0 and a maximum value of 300, which represents a reasonable range of RMSE values for the task. Teams with RMSE greater than 300 will receive an of 1, and thus an score of 0.

Additional Datasets

Participants are free to use any other datasets and models they wish in the challenge.

References

[1] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. ECCVW, pages 17–35, 2016.

[2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: The CLEAR MOT metrics. Imageand Video Processing, 2008.

[3] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybrid boosted multi-target tracker for crowded scene. CVPR, pages 2953–2960, 2009.

[4] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. ICCV, pages 1116–1124, 2015.