2026 Challenge Track Description

Track 3: Anomalous Events in Transportation

This track challenges participants to build a unified video understanding model capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.

The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.

  • Data 

Training Data

The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.

Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.

Source datasets:

Source Dataset

Reference

VAD-R1

https://arxiv.org/abs/2505.19877

TAD

https://arxiv.org/abs/2008.08944

Accident-Bench

https://arxiv.org/abs/2509.26636

SO-TAD

https://www.sciencedirect.com/science/article/abs/pii/S0925231224018320

TADBenchmark

https://arxiv.org/abs/2209.12386

Highway Traffic Videos Dataset

https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset

UCF Crime

https://arxiv.org/abs/1801.04264

Barbados Traffic Analysis Challenge

https://zindi.africa/competitions/barbados-traffic-analysis-challenge/data

Task types (10 tasks across 3 groups):

Task Group

Task Type

Description

Samples

Basic

Event Verification (bcq)

Binary Yes/No questions

7,340

Basic

Event Verification with Explanation (bcq_openended)

Binary Yes/No + explanation

7,340

Basic

Multiple-Choice QA (mcq)

Select the correct answer

3,670

Basic

Multiple-Choice QA with Explanation (mcq_openended)

Select the correct answer + explanation

3,670

Basic

Open-Ended QA (open_qa)

Free-form question about the anomaly

3,670

Scene

Scene Description (scene_description)

Static description of the scene

3,670

Scene

Video Summary (video_summarization)

Summary of what happened

3,670

Temporal

Temporal Localization (temporal_localization)

Identify when the anomaly occurs

3,670

Temporal

Causal Linkage (causal_linkage)

What caused the anomaly?

3,670

Temporal

Event Description (temporal_description)

Describe what happened in the interval

3,670

Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.

Annotation format example (Event Verification task):

{
  "version": "metropolis-v3.0",
  "metadata": {
"type": "bcq",
"date": "2026-04-14",
"description": "Binary choice QA (Yes/No answer only)",
"tags": ["so-tad", "anomaly"]
 },
  "items": [
{
"video_id": "main",
   "question": "Does a rear-end collision occur in the video?",
   "reasoning": "The video shows a nighttime scene at a four-way intersection with wet road conditions...",
   "answer": "Yes"
}
 ]
}
  • Test Sets, Evaluation, and Submission 

    The test sets, evaluation server, submission format, and detailed evaluation metrics will be released in mid-May 2026. The evaluation will cover:

    In-domain test set: Human-verified traffic anomaly videos, evaluated on the same 10 task types as training

    Out-of-domain test set 1: Fisheye intersection footage (FishEye8K, https://arxiv.org/abs/2305.17449), testing generalization to a different task formulation

    Out-of-domain test set 2: Egocentric dashcam video (PSI, https://neurips.cc/virtual/2025/loc/san-diego/poster/121383), testing generalization across visual domains and task types

  • Data Access

Training annotations can be found on Hugging Face. [The dataset URL is coming soon]

A download script is included to retrieve the source videos from their original public repositories. By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use.

    •