2023 Data and Evaluation Method

To participate, please fill out this online AI City Challenge Datasets Request Form.

Data Download Links

Track 1: Multi-Camera People Tracking

Track 2: Tracked-Vehicle Retrieval by Natural Language Descriptions

Track 3: Naturalistic Driving Action Recognition

Track 4: Multi-Class Product Counting & Recognition for Automated Retail Checkout

Track 5: Detecting Violation of Helmet Rule for Motorcyclists

Evaluation and Submission

For each of the four challenge tasks, a different data set will be provided as a set of videos or images. Associated numeric video IDs for each track are obtained by sorting track videos (or the name of the folders in which they are stored) in alphanumeric order, with numbering starting at 1.  All the pixel coordinates are 0-based for all tracks.

Frame Extraction

Submissions for some tracks will require frame IDs for frames that contain information of interest. In order to ensure frame IDs are consistent across teams, we suggest that all teams use the FFmpeg library (https://www.ffmpeg.org/) to extract/count frames.

Track 1: Multi-Camera People Tracking

Data for Challenge Track 1 are from multiple cameras in multiple settings including warehouse settings from a building as well as from synthetically generated data in multiple indoor settings. We have built a synthetic animated people dataset using the NVIDIA Omniverse Platform. These synthetic videos will form large-scale training and test sets to be used along with the real-world data set for Track 1. All feeds are high resolution 1080p feeds at 30 frames per second.

    • Task

Teams should detect and track targets across multiple cameras.

    • Submission Format

One text file should be submitted containing, on each line, details of a detected and tracked person, in the following format. Values are space-delimited.

camera_id obj_id frame_id xmin ymin width height xworld yworld

      • camera_id is the camera numeric identifier.
      • obj_id is a numeric identifier for each object. It should be a positive integer and consistent for each object identity across multiple cameras.
      • frame_id represents the frame count for the current frame in the current video, starting with 0.
      • The axis-aligned rectangular bounding box of the detected object is denoted by its pixel-valued coordinates within the image canvas, xmin ymin width height, computed from the top-left corner of the image. All values are integers.
      • xworld yworldare the GPS coordinates of the projected bottom points of each object. They are not currently used in the evaluation but may be used in the future. As such, it would be beneficial to include them if possible.

The text file containing all predictions should be named track1.txt and can be archived using Zip (track1.zip) or tar+gz (track1.tar.gz) to reduce upload time.

    • Evaluation

For MTMC tracking, the IDF1 score [1] will be used to rank the performance of each team. IDF1 measures the ratio of correctly identified detections over the average number of ground-truth and computed detections. The evaluation tool provided with our dataset also computes other evaluation measures adopted by the MOTChallenge [2], [3], such as Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP), mostly tracked targets (MT), and false alarm rate (FAR). However, they will NOT be used for ranking purposes. The measures that will be displayed in the evaluation system are IDF1,IDP, IDR, Precision (detection) and Recall (detection).

Track 2: Tracked-Vehicle Retrieval by Natural Language Descriptions

The dataset for this track is built upon the CityFlow Benchmark by annotating vehicles with natural language descriptions (need to download track1 data as well in order to work on this track). This dataset contains 2498 tracks of vehicles with three unique natural language descriptions each. 530 unique vehicle tracks together with 530 query sets each with three descriptions are curated for this challenge. 

The dataset curated for this challenge track consists of three files:  train-tracks.jsontest-tracks.json, test-queries.json. Please refer to the README file in the data set for details.

    • Task

Teams should retrieve and rank the provided vehicle tracks for each of the queries. A baseline retrieval model is provided as a demo for a start point for participating teams.

    • Submission Format

One JSON file should be submitted containing a dictionary in the format of the following:

{

query-uuid-1: [track-uuid-i, …, track-uuid-j],

query-uuid-2: [track-uuid-m, …, track-uuid-n],

}

For each query, teams should submit a list of provided testing tracks ranked by the retrieval model.

    • Evaluation

The Vehicle Retrieval by NL Descriptions task is evaluated using standard metrics for retrieval tasks. We use the Mean Reciprocal Rank (MRR) [4] as the main evaluation metric. Recall @ 5, Recall @ 10, and Recall @ 25 are also evaluated for all submissions.

Track 3: Naturalistic Driving Action Recognition

Distracted driving is highly dangerous and is reported to kill about 8 people every day in the United States. Today, naturalistic driving studies and computer vision techniques provide the much-needed solution to identify and eliminate distracted driving behavior on the road. In this challenge track, users will be presented with synthetic naturalistic data of the driver collected from three camera locations inside the vehicle (while the driver is pretending to be driving). The objective is to identify the start time, end time and type of distracted behavior activities executed by the driver in each video. Participating teams will have the option to use any one or two or all three camera views for the classification of driver tasks. Teams will be provided with training data with labels to develop algorithms and should submit functional code that can be executed on a reserved testing data set. The final winner will be determined by the performance on this reserved testing data set.  

    • Data 

The data set contains 210 video clips (about 34 hours in total) captured from 35 drivers. Drivers do every one of the 16 different tasks (such as talking on the phone, eating, and reaching back) once, in random order. There are three cameras mounted in the car, recording from different angles in synchronization. Each driver performs the data collection tasks twice: in one go with no appearance block and in another go with some appearance block (e.g., sunglasses, hat). Thus, there are 6 videos collected for each driver, 3 videos in sync with no appearance block and 3 videos in sync with some appearance block, resulting in 210 videos in total. 

The 34 hours of videos (35 drivers total) in this track are split into three data sets A1, A2 and B, each containing 25, 5, and 5 drivers respectively. Teams will be provided dataset A1 with the ground truth labels of start time, end time and type of distracted behaviors (manually annotated), and dataset A2 with no labels. Teams can use both A1 and A2 to develop their algorithms and submit results for data set A2 to our online evaluation server to be represented on the public leader board for performance tracking. The public leader board only provides a way for a team to evaluate and improve their systems and the ranking will NOT determine the winners of this track. 

Dataset B is reserved for later testing. Top performers on the public ranking board will be invited to submit functional code for both training and inference for the problem. Organizers will test the submitted code against dataset B and the final winner will be determined on the performance against dataset B. Teams wishing to be considered for evaluation on dataset B must also make their training and inference codes publicly available. 

    • Submission Format 

To be ranked on the public leader board of data set A2, one text file should be submitted to the online evaluation system containing, on each line, details of one identified activity, in the following format (values are space-delimited): 

〈video_id〉 〈activity_id〉 〈start_time〉 〈end_time〉 

Where: 

      • 〈video_id〉 is the video numeric identifier, starting with 1. Video IDs have been provided in the video_ids.csv file in the data download link.
      • 〈activity_id〉 is the activity numeric identifier, starting with 0. 
      • 〈start_time〉 is the time the identified activity starts, in seconds. The start_time is an integer value, e.g., 127 represents the 127th second of the video, 02:07 into the video. 
      • 〈end_time〉 is the time the identified activity ends, in seconds. The end_time is an integer value, e.g., 263 represents the 263rd second of the video, 04:23 into the video.
    • Evaluation 

Evaluation for track 3 will be based on model activity identification performance, measured by the average activity overlap score, which is defined as follows. Given a ground-truth activity g with start time gs and end time ge, we will find its closest predicted activity match as that predicted activity p of the same class as g and highest overlap score os, with the added condition that start time ps and end time pe are in the range [gs – 10s, gs + 10s] and [ge – 10s, ge + 10s], respectively. The overlap between g and p is defined as the ratio between the time intersection and the time union of the two activities, i.e.,

After matching each ground truth activity in order of their start times, all unmatched ground truth activities and all unmatched predicted activities will receive an overlap score of 0. The final score is the average overlap score among all matched and unmatched activities.

Track 4: Multi-Class Product Counting & Recognition for Automated Retail Checkout

A growing application of AI and computer vision is in the retail industry. Of the various problems that can be addressed, this track focuses on accurate and automatic check-out in a retail store. As the first version of this new track, participating teams will identify/classify products when a customer is hand holding items in front of the checkout counter. Products may be occluded or very similar to each other. 

    • Data 

This data set contains a total of 116,500 synthetic images and several video clips from over 100 different merchandise items. The synthetic images are created from 3D scanned object models and will be used for training. The 3D asserts and generation tool are available at https://github.com/yorkeyao/Automated-Retail-Checkout. We use synthetic data because they can form large-scale training sets under various environments.  In our test scenario, the camera is mounted above the checkout counter and facing straight down while a customer is pretending to perform a checkout action by “scanning” objects in front of the counter in a natural manner. Several different customers participated and each of them scanned slightly differently to add to the complexity. There is a shopping tray placed under the camera to indicate where the AI model should focus. Participating customers might or might not place objects on the tray. One video clip contains several complete scanning actions, involving one or more items. In summary, the dataset contains: 

      • Training set – 116,500 synthetic images with classification and segmentation labels.  
      • Test set A – around 40% of recorded test videos 
      • Test set B – around 60% of recorded test videos

Teams will be provided with the training set (with labels) and test set A (without labels). Test set B will be reserved for later testing. 

Participating teams need to train a model using the training set provided and classify the merchandise item held by the customer in each of the video clips. Teams can use test set A to develop inference code. Teams then submit results for test set A to our online evaluation server to be shown on the public leader board for performance tracking. The public leader board only provides a way for a team to evaluate and improve their systems and the ranking will NOT determine the winners of this track. 

Test set B is reserved for later testing. Top performers on the public ranking board will be invited to submit functional training and inference code. Organizers will test the submitted code against dataset B and the final winner will be determined on the model’s performance against Test set B. If there is a tie between top teams, efficiency of inference code will be used as the tie breaker, where the team with the most efficient model will be the winner. Teams wishing to be considered for evaluation on dataset B must also make their training and inference codes publicly available. 

    • Submission Format 

To be ranked on the public leader board of test set A, one text file should be submitted to the online evaluation system containing, on each line, details of one identified activity, in the following format (values are space-delimited): 

〈video_id〉 〈class_id〉 〈frame_id〉 

Where: 

      • 〈video_id〉 is the video numeric identifier, starting with 1. It represents the position of the video in the list of all track 4 test set A videos, sorted in alphanumeric order. 
      • 〈class_id〉 is the object numeric identifier, starting with 1. 
      • 〈frame_id〉 is the number of frame in the video when the object was first identified, in frame numbers counting from the beginning. The frame number is an integer and represents a time when the item is within the region of interest, i.e., over the white tray. Each object should only be identified once while it passes through the region of interest. 
         
    • Synthetic Data  

Synthetic data is provided for model training. There are 116,500 synthetic images from over 100 3D objects. Following the generation pipeline in [5], images are filmed with random attributes, i.e., random object orientation, camera pose, and lighting. Random background images, which are selected from Microsoft COCO [6], are used to increase the dataset diversity. The labeling format for synthetic data is “.jpg”, e.g., for the file 00001_697.jpg: 

      • 00001 means the object has class ID 1, and 
      • 697 is a counter, i.e., this is the 697th image. 

We also provide segmentation labels for these images. For example, “00001_697_seg.jpg” is the segmentation label for image “00001_697.jpg”. The white area denotes the object area while the black shows the background.  

    • Evaluation 

Evaluation for track 4 will be based on model identification performance, measured by the F1-score. For the purpose of computing the F1-score, a true-positive (TP) identification will be considered when an object was correctly identified within the region of interest, i.e., the object class was correctly determined, and the object was identified within the time that the object was over the white tray. A false-positive (FP) is an identified object that is not a TP identification. Finally, a false-negative (FN) identification is a ground-truth object that was not correctly identified. 

Track 5: Detecting Violation of Helmet Rule for Motorcyclists

Motorcycles are one of the most popular modes of transportation, particularly in developing countries such as India. Due to lesser protection compared to cars and other standard vehicles, motorcycle riders are exposed to a greater risk of crashes. Therefore, wearing helmets for motorcycle riders is mandatory as per traffic rules and automatic detection of motorcyclists without helmets is one of the critical tasks to enforce strict regulatory traffic safety measures.

    • Data

The training dataset contains 100 videos and groundtruth bounding boxes of motorcycle and motorcycle rider(s) with or without helmets. Each video is 20 seconds duration, recorded at 10 fps. The video resolution is 1920×1080.

Each motorcycle in the annotated frame has bounding box annotation of each rider with or without helmet information, for upto a maximum of 3 riders in a motorcycle. The class id (labels) of the object classes in this dataset is as follows:

  • 1, motorbike: bounding box of motorcycle
  • 2, DHelmet: bounding box of the motorcycle driver, if he/she is wearing a helmet
  • 3, DNoHelmet: bounding box of the motorcycle driver, if he/she is not wearing a helmet
  • 4, P1Helmet: bounding box of the passenger 1 of the motorcycle, if he/she is wearing a helmet
  • 5, P1NoHelmet: bounding box of the passenger 1 of the motorcycle, if he/she is not wearing a helmet
  • 6, P2Helmet: bounding box of the passenger 2 of the motorcycle, if he/she is wearing a helmet
  • 7, P2NoHelmet: bounding box of the passenger 2 of the motorcycle, if he/she is not wearing a helmet.

The groundtruth file contain bounding box information (one object instance per line) for each video. The schema is as follows (values are comma-separated):

video_id〉, frame〉, track_id〉, bb_left〉, bb_top〉, 〈bb_width〉, 〈bb_height〉, class

      • 〈video_id is the video numeric identifier, starting with 1. It represents the position of the video in the list of all videos, sorted in alphanumeric order.
      • 〈frame represents the frame count for the current frame in the current video, starting with 1.
      • 〈track_id is the object track id in the video.
      • bb_leftis the x-coordinate of the top left point of the bounding box.
      • 〈bb_top〉 is the y-coordinate of the top left point of the bounding box.
      • 〈bb_widthis the width of the bounding box.
      • 〈bb_heightis the height of the bounding box.
      • classis the class id of the object as given in the labels information above.

The test dataset will contain 100 videos of 20 seconds each, recorded at 10 fps, similar to the training dataset videos. The test dataset will be released later.

    • Task

Teams should identify motorcycle and motorcycle rider(s) with or without helmet. Similar to the training dataset, each rider in a motorcycle (i.e., driver, passenger 1, passenger 2) is to be separately identified if they have a helmet or not.

    • Submission Format

One text file should be submitted, containing on each line, details of a detected object and the corresponding class id (as per the labels information). The submission format schema to be followed is as follows (values are comma separated).

video_id〉, frame, bb_left〉, bb_top〉, 〈bb_width〉, 〈bb_height〉, class〉, 〈confidence〉

      • 〈video_id is the video numeric identifier, starting with 1. It represents the position of the video in the list of all videos, sorted in alphanumeric order.
      • 〈frame represents the frame count for the current frame in the current video, starting with 1.
      • bb_leftis the x-coordinate of the top left point of the bounding box.
      • 〈bb_top〉 is the y-coordinate of the top left point of the bounding box.
      • 〈bb_widthis the width of the bounding box.
      • 〈bb_heightis the height of the bounding box.
      • classis the class id of the object as given in the labels information above.
      • 〈confidenceis the confidence score of the bounding box (values between 0 to 1).

Note unlike the training data, object track id, denoted as is not required for the test data submission.

    • Evaluation

The metric used to rank the performance of each team will be the mean Average Precision (mAP) across all frames in the test videos. mAP measures the mean of average precision (the area under the Precision-Recall curve) over all the object classes, as defined in PASCAL VOC 2012 competition.

Additional Datasets

Teams that wish to be listed in the public leader board and win the challenge awards are NOT allowed to use any external data for either training or validation. The winning teams and runners-up are required to submit their training and testing codes for verification after the challenge submission deadline in order to ensure that no external data was used for training and the tasks were performed by algorithms and not humans.

References

[1] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. ECCVW, pages 17–35, 2016.

[2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance:  The  CLEAR  MOT  metrics. Imageand Video Processing, 2008.

[3] Y. Li, C. Huang, and R. Nevatia.  Learning to associate: Hybrid boosted multi-target tracker for crowded scene. CVPR, pages 2953–2960, 2009.

[4] Voorhees, Ellen M. “The TREC-8 question answering track report.” In Trec, vol. 99, pp. 77-82. 1999.

[5] Yao, Yue, Liang Zheng, Xiaodong Yang, Milind Naphade, and Tom Gedeon. “Attribute Descent: Simulating Object-Centric Datasets on the Content Level and Beyond.” In arxiv 2202.14034, 2022. 

[6] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. “Microsoft coco: Common objects in context.” In ECCV, 2014.