2022 Data and Evaluation Method

Data Download Links

Track 1: City-Scale Multi-Camera Vehicle Tracking

Track 2: Tracked-Vehicle Retrieval by Natural Language Descriptions

Track 3: Naturalistic Driving Action Recognition

Track 4: Multi-Class Product Counting & Recognition for Automated Retail Checkout

Evaluation and Submission

For each of the four challenge tasks, a different data set will be provided as a set of videos or images. Associated numeric video IDs for each track are obtained by sorting track videos (or the name of the folders in which they are stored) in alphanumeric order, with numbering starting at 1.  All the pixel coordinates are 0-based for all tracks.

Frame Extraction

Submissions for some tracks will require frame IDs for frames that contain information of interest. In order to ensure frame IDs are consistent across teams, we suggest that all teams use the FFmpeg library (https://www.ffmpeg.org/) to extract/count frames.

Track 1: City-Scale Multi-Camera Vehicle Tracking

The dataset for Track 1, i.e., CityFlowV2, is the same as the dataset of Track 3 in the 5th edition. The validation set is the same as the test set of the original CityFlow dataset. The dataset contains 3.58 hours (215.03 minutes) of videos collected from 46 cameras spanning 16 intersections in a mid-sized U.S. city. The distance between the two furthest simultaneous cameras is 4 km. The dataset covers a diverse set of location types, including intersections, stretches of roadways, and highways. The dataset is divided into 6 scenarios. 3 of the scenarios are used for training, 2 are for validation, and the remaining 1 is for testing. In total, the dataset contains 313931 bounding boxes for 880 distinct annotated vehicle identities. Only vehicles passing through at least 2 cameras have been annotated. The resolution of each video is at least 960p and the majority of the videos have a frame rate of 10 FPS. Additionally, in each scenario, the offset from the start time is available for each video, which can be used for synchronization. Please refer to the ReadMe.txt file for more details.

    • Task

Teams should detect and track targets across multiple cameras. Baseline detection and single-camera tracking results are provided, but teams are also allowed to use their own methods.

    • Submission Format

One text file should be submitted containing, on each line, details of a detected and tracked vehicle, in the following format. Values are space-delimited.

camera_id obj_id frame_id xmin ymin width height xworld yworld

      • camera_id is the camera numeric identifier, between 1 and 46.
      • obj_id is a numeric identifier for each object. It should be a positive integer and consistent for each object identity across multiple cameras.
      • frame_id represents the frame count for the current frame in the current video, starting with 1.
      • The axis-aligned rectangular bounding box of the detected object is denoted by its pixel-valued coordinates within the image canvas, xmin ymin width height, computed from the top-left corner of the image. All values are integers.
      • xworld yworldare the GPS coordinates of the projected bottom points of each object. They are not currently used in the evaluation but may be used in the future. As such, it would be beneficial to include them if possible.

The text file containing all predictions should be named track1.txt and can be archived using Zip (track1.zip) or tar+gz (track1.tar.gz) to reduce upload time.

    • Evaluation

For MTMC tracking, the IDF1 score [1] will be used to rank the performance of each team. IDF1 measures the ratio of correctly identified detections over the average number of ground-truth and computed detections. The evaluation tool provided with our dataset also computes other evaluation measures adopted by the MOTChallenge [2], [3], such as Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP), mostly tracked targets (MT), and false alarm rate (FAR). However, they will NOT be used for ranking purposes. The measures that will be displayed in the evaluation system are IDF1,IDP, IDR, Precision (detection) and Recall (detection).

Track 2: Tracked-Vehicle Retrieval by Natural Language Descriptions

The dataset for this track is built upon the CityFlow Benchmark by annotating vehicles with natural language descriptions (need to download track1 data as well in order to work on this track). This dataset contains 2498 tracks of vehicles with three unique natural language descriptions each. 530 unique vehicle tracks together with 530 query sets each with three descriptions are curated for this challenge. 

The dataset curated for this challenge track consists of three files:  train-tracks.jsontest-tracks.json, test-queries.json. Please refer to the README file in the data set for details.

    • Task

Teams should retrieve and rank the provided vehicle tracks for each of the queries. A baseline retrieval model is provided as a demo for a start point for participating teams.

    • Submission Format

One JSON file should be submitted containing a dictionary in the format of the following:

{

query-uuid-1: [track-uuid-i, …, track-uuid-j],

query-uuid-2: [track-uuid-m, …, track-uuid-n],

}

For each query, teams should submit a list of provided testing tracks ranked by the retrieval model.

    • Evaluation

The Vehicle Retrieval by NL Descriptions task is evaluated using standard metrics for retrieval tasks. We use the Mean Reciprocal Rank (MRR) [4] as the main evaluation metric. Recall @ 5, Recall @ 10, and Recall @ 25 are also evaluated for all submissions.

Track 3: Naturalistic Driving Action Recognition

Distracted driving is highly dangerous and is reported to kill about 8 people every day in the United States. Today, naturalistic driving studies and computer vision techniques provide the much-needed solution to identify and eliminate distracted driving behavior on the road. In this challenge track, users will be presented with synthetic naturalistic data of the driver collected from three camera locations inside the vehicle (while the driver is pretending to be driving). The objective is to identify the start time, end time and type of distracted behavior activities executed by the driver in each video. Participating teams will have the option to use any one or two or all three camera views for the classification of driver tasks. Teams will be provided with training data with labels to develop algorithms and should submit functional code that can be executed on a reserved testing data set. The final winner will be determined by the performance on this reserved testing data set.  

    • Data 

The data set contains 90 video clips (about 14 hours in total) captured from 15 drivers. Drivers do every one of the 18 different tasks (such as talking on the phone, eating, and reaching back) once, in random order. There are three cameras mounted in the car, recording from different angles in synchronization. Each driver performs the data collection tasks twice: in one go with no appearance block and in another go with some appearance block (e.g., sunglasses, hat). Thus, there are 6 videos collected for each driver, 3 videos in sync with no appearance block and 3 videos in sync with some appearance block, resulting in 90 videos in total. 

The 14 hours of videos (15 drivers total) in this track are split into three data sets A1, A2 and B, each containing 5, 5, and 5 drivers respectively. Teams will be provided dataset A1 with the ground truth labels of start time, end time and type of distracted behaviors (manually annotated), and dataset A2 with no labels. Teams can use both A1 and A2 to develop their algorithms and submit results for data set A2 to our online evaluation server to be represented on the public leader board for performance tracking. The public leader board only provides a way for a team to evaluate and improve their systems and the ranking will NOT determine the winners of this track. 

Dataset B is reserved for later testing. Top performers on the public ranking board will be invited to submit functional code for both training and inference for the problem. Organizers will test the submitted code against dataset B and the final winner will be determined on the performance against dataset B. Teams wishing to be considered for evaluation on dataset B must also make their training and inference codes publicly available. 

    • Submission Format 

To be ranked on the public leader board of data set A2, one text file should be submitted to the online evaluation system containing, on each line, details of one identified activity, in the following format (values are space-delimited): 

〈video_id〉 〈activity_id〉 〈start_time〉 〈end_time〉 

Where: 

      • 〈video_id〉 is the video numeric identifier, starting with 1. Video IDs have been provided in the video_ids.csv file in the data download link.
      • 〈activity_id〉 is the activity numeric identifier, starting with 0. 
      • 〈start_time〉 is the time the identified activity starts, in seconds. The start_time is an integer value, e.g., 127 represents the 127th second of the video, 02:07 into the video. 
      • 〈end_time〉 is the time the identified activity ends, in seconds. The end_time is an integer value, e.g., 263 represents the 263rd second of the video, 04:23 into the video.
    • Evaluation 

Evaluation for track 3 will be based on model activity identification performance, measured by the F1-score. For the purpose of computing the F1-score, a true-positive (TP) activity identification will be considered when an activity was correctly identified (matching activity_id) as starting within one second of the start_time and ending within one second of the end_time of the activity. An activity should only be reported once whether using a single video or multiple videos in the inference. A false-positive (FP) is an identified activity that is not a TP activity. Finally, a false-negative (FN) activity is a ground-truth activity that was not correctly identified. 

Track 4: Multi-Class Product Counting & Recognition for Automated Retail Checkout

A growing application of AI and computer vision is in the retail industry. Of the various problems that can be addressed, this track focuses on accurate and automatic check-out in a retail store. As the first version of this new track, participating teams will identify/classify products when a customer is hand holding items in front of the checkout counter. Products may be occluded or very similar to each other. 

    • Data 

This data set contains a total of 116,500 synthetic images and several video clips from over 100 different merchandise items. The synthetic images are created from 3D scanned object models and will be used for training. We use synthetic data because they can form large-scale training sets under various environments.  In our test scenario, the camera is mounted above the checkout counter and facing straight down while a customer is pretending to perform a checkout action by “scanning” objects in front of the counter in a natural manner. Several different customers participated and each of them scanned slightly differently to add to the complexity. There is a shopping tray placed under the camera to indicate where the AI model should focus. Participating customers might or might not place objects on the tray. One video clip contains several complete scanning actions, involving one or more items. In summary, the dataset contains: 

      • Training set – 116,500 synthetic images with classification and segmentation labels.  
      • Test set A – 20% of recorded test video 
      • Test set B – 80% of recorded test video

Teams will be provided with the training set (with labels) and test set A (without labels). Test set B will be reserved for later testing. 

Participating teams need to train a model using the training set provided and classify the merchandise item held by the customer in each of the video clips. Teams can use test set A to develop inference code. Teams then submit results for test set A to our online evaluation server to be shown on the public leader board for performance tracking. The public leader board only provides a way for a team to evaluate and improve their systems and the ranking will NOT determine the winners of this track. 

Test set B is reserved for later testing. Top performers on the public ranking board will be invited to submit functional training and inference code. Organizers will test the submitted code against dataset B and the final winner will be determined on the model’s performance against Test set B. If there is a tie between top teams, efficiency of inference code will be used as the tie breaker, where the team with the most efficient model will be the winner. Teams wishing to be considered for evaluation on dataset B must also make their training and inference codes publicly available. 

    • Submission Format 

To be ranked on the public leader board of test set A, one text file should be submitted to the online evaluation system containing, on each line, details of one identified activity, in the following format (values are space-delimited): 

〈video_id〉 〈class_id〉 〈timestamp〉 

Where: 

      • 〈video_id〉 is the video numeric identifier, starting with 1. It represents the position of the video in the list of all track 4 test set A videos, sorted in alphanumeric order. 
      • 〈class_id〉 is the object numeric identifier, starting with 1. 
      • 〈timestamp〉 is the time in the video when the object was first identified, in seconds. The timestamp is an integer and represents a time when the item is within the region of interest, i.e., over the white tray. Each object should only be identified once while it passes through the region of interest. 
         
    • Synthetic Data  

Synthetic data is provided for model training. There are 116,500 synthetic images from over 100 3D objects. Following the generation pipeline in [5], images are filmed with random attributes, i.e., random object orientation, camera pose, and lighting. Random background images, which are selected from Microsoft COCO [6], are used to increase the dataset diversity. The labeling format for synthetic data is “.jpg”, e.g., for the file 00001_697.jpg: 

      • 00001 means the object has class ID 1, and 
      • 697 is a counter, i.e., this is the 697th image. 

We also provide segmentation labels for these images. For example, “00001_697_seg.jpg” is the segmentation label for image “00001_697.jpg”. The white area denotes the object area while the black shows the background.  

    • Evaluation 

Evaluation for track 4 will be based on model identification performance, measured by the F1-score. For the purpose of computing the F1-score, a true-positive (TP) identification will be considered when an object was correctly identified within the region of interest, i.e., the object class was correctly determined, and the object was identified within the time that the object was over the white tray. A false-positive (FP) is an identified object that is not a TP identification. Finally, a false-negative (FN) identification is a ground-truth object that was not correctly identified. 

Additional Datasets

Teams that wish to be listed in the public leader board and win the challenge awards are NOT allowed to use any external data for either training or validation. The winning teams and runners-up are required to submit their training and testing codes for verification after the challenge submission deadline in order to ensure that no external data was used for training and the tasks were performed by algorithms and not humans.

References

[1] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. ECCVW, pages 17–35, 2016.

[2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance:  The  CLEAR  MOT  metrics. Imageand Video Processing, 2008.

[3] Y. Li, C. Huang, and R. Nevatia.  Learning to associate: Hybrid boosted multi-target tracker for crowded scene. CVPR, pages 2953–2960, 2009.

[4] Voorhees, Ellen M. “The TREC-8 question answering track report.” In Trec, vol. 99, pp. 77-82. 1999.

[5] Yao, Yue, Liang Zheng, Xiaodong Yang, Milind Naphade, and Tom Gedeon. “Attribute Descent: Simulating Object-Centric Datasets on the Content Level and Beyond.” In arxiv 2202.14034, 2022. 

[6] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. “Microsoft coco: Common objects in context.” In ECCV, 2014.