2025-track2 – AI CITY CHALLENGE

2025 Challenge Track Description

Track 2: Traffic Safety Description and Analysis

This task revolves around the long fine-grained video captioning of traffic safety scenarios, especially those involving pedestrian accidents. Leveraging multiple cameras and viewpoints, participants will be challenged to describe the continuous moment before the incidents, as well as the normal scene, captioning all pertinent details regarding the surrounding context, attention, location, and behavior of the pedestrian and vehicle. This task provides a new dataset WTS, featuring staged accidents with stunt drivers and pedestrians in a controlled environment, and offers a unique opportunity for detailed analysis in traffic safety scenarios. The analysis result could be valuable for wide usage across industry and society, e.g., it could lead to the streamlining of the inspection process in insurance cases and contribute to the prevention of pedestrian accidents. More features of the dataset can be referred to the dataset homepage.

- Data

The train and validation dataset contains 810 videos and 155 scenarios, each scenario has ~5 segments annotated for capturing detailed behavior changes from pre-recognition, recognition, judgment, action, and avoidance. Each segment has 2 long detailed captions generated from a manual checklist with 170+ items regarding traffic scenarios about pedestrians and vehicles respectively with an average caption length of about 58.7 words. Target pedestrian and vehicle Bounding Box are provided as instance information that the caption related. All videos are provided in a high 1080p resolution at 30 fps. Moreover, this task dataset also provides long fine-grained caption annotations obtained using the same annotation manner for around 3.4K pedestrian-related traffic videos selected from BDD100K, which are offered to be used as part of the external train and validation dataset for generalized performance check.

The description focuses on [location][attention][behavior][context] information about the pedestrian and vehicle respectively, especially about the moments as short segments before the staged accidents along the time directions, as well as normal cases.

The ground truth file contains caption and target instance BBox information. BBox is annotated manually for the first frame of each segment and uses the video object tracking method for generating the left frames BBox.

Adding from last year, 3D Gaze from pedestrians which captures the pedestrian attention is provided. In addition, a Traffic VQA with ~180 kinds of items (e.g., direction, position, action, attributes, etc.) along with the captioning result as a holistic structured information is also included in this spatial-temporal traffic video understanding benchmark .

Caption annotation format is defined as:

{
    "id": 722, ## UUID
    "overhead_videos": [  ## caption related videos
        "20230707_8_SN46_T1_Camera1_0.mp4",
        "20230707_8_SN46_T1_Camera2_1.mp4",
        "20230707_8_SN46_T1_Camera2_2.mp4",
        "20230707_8_SN46_T1_Camera3_3.mp4"
    ],
    "event_phase": [
        {
            "labels": [
                "4"  ## segment number
            ],
            "caption_pedestrian": "The pedestrian stands still on the left, ...",  ## caption for pedestrian during the segment
            "caption_vehicle": "The vehicle was positioned diagonally to ...",  ## caption for vehicle during the segment
            "start_time": "39.395",  ## start time of the segment in seconds, 0.0 is the starting time of the given video.
            "end_time": "44.663"     ## end time of the segment in seconds
        },
...

BBox format follows COCO format, we provided a frame extraction script here (https://github.com/woven-visionai/wts-dataset/tree/main?tab=readme-ov-file#data-preparation) to reproduce the frame ID for associating with our annotation .

{
    "annotations": [
        {
            "image_id": 904, ## frame ID
            "bbox": [
                1004.4933333333333, ## x_min 
                163.28666666666666, ## y_min
                12.946666666666667, ## width
                11.713333333333333  ## height
            ],
            "auto-generated": false,  ## human annotated frame
            "phase_number": "0"  ## segment index
        },
        {
            "image_id": 905,
            "bbox": [
                1007.1933333333333,
                162.20666666666668,
                12.946666666666667,
                11.713333333333333
            ],
            "auto-generated": true,  ##generated bbox annotation for the frame
            "phase_number": "0"
        },
...

Gaze annotation follows the similar structure as BBox, as shown below. The gaze (x, y, z) is in overhead camera coordinates in OpenGL axis convention (x to the right, y up, and z backward). image_id refers to the frame number in the overhead video.

{
    "annotations": [
        {
            "image_id": 0, ## frame ID
            "gaze": [
                0.7267333451506679, ## x
                0.27087537465994793, ## y
                -0.6312568142259175 ## z
            ],
        },
        {
            "image_id": 1,
            "gaze": [
                0.7267333451506679,
                0.27087537465994793,
                -0.6312568142259175
            ],
        },
...

Head annotation also follows the similar structure as BBox, as shown below. The head (x, y) is in image coordinates (absolute pixel values). image_id refers to the frame number in the overhead video.

{
    "annotations": [
        {
            "image_id": 0,   ## frame ID
            "head": [
                32.5444,  ## x
                16.9874   ## y
            ]
        },
        {
            "image_id": 1,
            "head": [
                65.4982,
                76.9873
            ]
        },
        ...
    ]
}

The questions for VQA are provided below. For each video, there are multiple questions provided.

[
  {
    "videos": [
      "20231013_101845_normal_192.168.0.13_4_event_2.mp4"
    ],
    "event_phase": [
      {
        "start_time": "30.000",
        "end_time": "37.223",
        "labels": [
          "avoidance"
        ],
        "conversations": [
          {
            "id": "0d159ab9-d5a3-4088-be0f-bcb7adcd7ff1",
            "question": "What is the orientation of the pedestrian's body?",
            "a": "Opposite direction to the vehicle",
            "b": "Perpendicular to the vehicle and to the right",
            "c": "Diagonally to the left, in the opposite direction to the vehicle",
            "d": "Perpendicular to the vehicle and to the left",
            "correct": ""
          },
          {
            "id": "2ca17cfe-a971-4ace-9ace-73e57661dee3",
            "question": "What is the position of the pedestrian relative to the vehicle?",
            "a": "Diagonally to the right, in front of the vehicle",
            "b": "On the left of the vehicle",
            "c": "Diagonally to the left in front of the vehicle",
            "d": "Directly in front of the vehicle",
            "correct": ""
          }
        ]
      }
    ]
  },
  ...
]

The GT for VQA is provided below:

[
  {
    "id": "3c8c80e3-33f1-4133-a86c-1192c8a26159",
    "correct": "a"
  },
  {
    "id": "be2f113a-c387-4987-befd-32a9c6dc488a",
    "correct": "a"
  },
  ...
]

- Task

Sub Task1: Teams in this challenge will provide two captions about pedestrians and vehicles for each segment in the traffic events in the video involving the accidents and normal scenes. The performance will be evaluated across multiple metrics that measure the fidelity of the predicted description against the ground truth.

Sub Task2: Teams will provide the one correct answer chosen from the given multiple choices regarding the traffic related questions. The performance will be evaluated using the accuracy of chosen answers for each question.

The winner will be decided by using the average of the score from sub-task1 and sub-task2.

- Submission Format

For sub-task1, the test results are required to be provided per scenario. Users could use the multi-view videos in the same scenario folders for validation purposes, as well as multi-view videos in train for training purposes. For the normal scenarios in “normal_trimmed” folder and “BDD_PC_5K” part, every single video in the test set is required to provide the caption results.

{
    "20230707_12_SN17_T1": [  ##scenario index
        {
            "labels": [  ## segment number, this is known information will be given
                "4"
            ],
            "caption_pedestrian": "",  ## caption regarding pedestrian 
            "caption_vehicle": ""      ## caption regarding vehicle
        },
        {
            "labels": [
                "3"
            ],
            "caption_pedestrian": "",
            "caption_vehicle": ""
        },
        {
            "labels": [
                "2"
            ],
            "caption_pedestrian": "",
            "caption_vehicle": ""
        },
        {
            "labels": [
                "1"
            ],
            "caption_pedestrian": "",
            "caption_vehicle": ""
        },
        {
            "labels": [
                "0"
            ],
            "caption_pedestrian": "",
            "caption_vehicle:  ""
        }
    ]
}

Notice that, unlike the training data, the segment label and its timestamp are not required for the test data submission.

For sub-task2, similar to sub-task1, the test results are required to be provided per scenario. Users could use the multi-view videos in the same scenario folders for validation purposes, as well as multi-view videos in train for training purposes. For the normal scenarios in “normal_trimmed” folder and “BDD_PC_5K” part, every single video in the test set is required to provide the Answer results. The task is a multi-choice task that the user needs to provide the correct chosen answer in the format below:

  {
    "id": "3c8c80e3-33f1-4133-a86c-1192c8a26159",
    "correct": "a"
  },
  ...

where id is the question id in the test set and correct is the predicted option label.

Please refer here for more details.

- Evaluation

Data from “BDD_PC_5K” are available to be used as a train part, the user could use it directly into the training or utilize it as pre-train as well. Test includes the “BDD_PC_5K” test part as well as the generalization performance test. The metric used to rank the performance of each team will be an averaged accuracy to compare the predicted descriptions against the ground truth with multiple metrics across all scenarios including the video in “normal_trimmed” folder in the staged WTS dataset and “BDD_PC_5K” part. For sub-task1, 4 metrics for being averaged are BLUE-4, METERO, ROUGE-L, and CIDEr. For sub-task 2, accuracy for the QA, as “correct_answer / total_question_number” is used as the evaluation index. The winner will be decided according to the mean score of sub-task 1 and sub-task 2.

Please pay attention that this is not a retrieval task and we are seeking for generative solutions. Teams may submit results to evaluation system and rank on the leaderboard with any method. But we will be manually evaluating award contenders and teams only using retrieval method will be disqualified from winning the awards. For example, considering a method which uses features extracted from the test set videos and retrieves the “closest-meaning” caption from the training set for submission, this will not be qualified for winning the track since it is not a generative solution.

More details about the dataset explanation can be referred to: https://github.com/woven-visionai/wts-dataset

- Data Access

To access the data please submit the following data request form: https://forms.gle/szQPk1TMR8JXzm327

Dataset will be sent to the provided email address upon approval.

~~[Traffic VQA data will be released after 2025/12/May]~~

[Traffic VQA data has been released on 05/21/2025. For the user who has already been granted access to the data, please download the “annotations.zip” file again to find the VQA train and val data, as well as for the external part.]

Test data can be accessed from here. Please find it once you have already finished the request in the above form.