2026-track5 – AI CITY CHALLENGE

2026 Challenge Track Description

Track 5: Generative Traffic Video Forecasting

This task focuses on leveraging the video modelling capabilities of recent generative models for traffic video forecasting. Participants are challenged to generate a sequence of realistic and plausible future frames conditioned on a short history of observed frames and the corresponding textual descriptions of the target future. This task is built on the WTS dataset, which contains diverse traffic scenarios and environments, together with detailed behavioural descriptions for both vehicles and pedestrians. More details about the dataset can be found on the dataset homepage.

Data

The training and validation set contains 810 videos and 155 scenarios. Each scenario includes ~5 annotated segments that capture detailed behavioural changes across pre-recognition, recognition, judgment, action, and avoidance. Each segment is associated with two long-form captions generated from a manual checklist of more than 170 traffic-scene attributes, describing pedestrians and vehicles respectively.

The caption annotation format is defined as follows:

{
  "id": 722,
  "overhead_videos": [
    "20230707_8_SN46_T1_Camera1_0.mp4",
    "20230707_8_SN46_T1_Camera2_1.mp4",
    "20230707_8_SN46_T1_Camera2_2.mp4",
    "20230707_8_SN46_T1_Camera3_3.mp4"
  ],
  "event_phase": [
    {
      "labels": [
        "4"
      ],
      "caption_pedestrian": "The pedestrian stands still on the left, ...",
      "caption_vehicle": "The vehicle was positioned diagonally to ...",
      "start_time": "39.395",
      "end_time": "44.663"
    }
  ]
}

While the text descriptions are splitted for pedestrian and vehicle, users can freely decide ways to combine them as a single or multiple inputs to prompt the video generation.

Task

For each segment in a traffic event, teams are provided with: (i) Two captions describing the pedestrian and vehicle behaviors, and (ii) a short sequence of initial frames as the history reference, typically from the last frames from a previous segment or first frames from the current segments. The goal is to generate the future frames conditioned on both (i) and (ii). The generated results will be evaluated using multiple metrics that assess both visual fidelity and accuracy of forecasting.

Submission Format

For each test case, participants are required to generate N frames ordered as

0.png, 1.png, 2.png, ..., N-1.png

Each generated frame must have the same resolution as the input history frames.

Evaluation

Data from “BDD_PC_5K” are available to be used as a train part, the user could use it directly into the training or utilize it as pre-train as well. The metrics used to rank the performance of each team measure the accuracy between generated frames against the ground truth, as well as the quality of generated frames themselves. The test scenes contain a subset of scenarios including the video in the “normal_trimmed” folder in the staged WTS dataset and the “BDD_PC_5K” part. Specifically, 5 metrics of PSNR, SSIM, LPIPS, CLIP-S, and FVD will be evaluated. The winner will be decided according to the average score across all metrics.

Teams may use any pretrained video generation models, as well as additional data sources, provided that the WTS test data are not used for training or adaptation. Any use of pretrained model weights or external data must be clearly declared in the technical report. Failure to comply with this requirement may result in disqualification from award consideration.

More details about the dataset explanation can be referred to: https://github.com/woven-visionai/wts-dataset-tv2v.

Data Access

To access the data please submit the following data request form: https://forms.gle/szQPk1TMR8JXzm327

Dataset will be sent to the provided email address upon approval.

2026 Challenge Track Description

Data

Task

Submission Format

Evaluation

Data Access