2026 Challenge Track Description
Track 3: Anomalous Events in Transportation
This track challenges participants to build a unified video understanding model capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.
The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.
Data
Training Data
The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.
Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.
Source datasets:
Source Dataset | Reference |
VAD-R1 | |
TAD | |
Accident-Bench | |
SO-TAD | https://www.sciencedirect.com/science/article/abs/pii/S0925231224018320 |
TADBenchmark | |
Highway Traffic Videos Dataset | https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset |
UCF Crime | |
Barbados Traffic Analysis Challenge | https://zindi.africa/competitions/barbados-traffic-analysis-challenge/data |
Task types (10 tasks across 3 groups):
Task Group | Task Type | Description | Samples |
Basic | Event Verification (bcq) | Binary Yes/No questions | 7,340 |
Basic | Event Verification with Explanation (bcq_openended) | Binary Yes/No + explanation | 7,340 |
Basic | Multiple-Choice QA (mcq) | Select the correct answer | 3,670 |
Basic | Multiple-Choice QA with Explanation (mcq_openended) | Select the correct answer + explanation | 3,670 |
Basic | Open-Ended QA (open_qa) | Free-form question about the anomaly | 3,670 |
Scene | Scene Description (scene_description) | Static description of the scene | 3,670 |
Scene | Video Summary (video_summarization) | Summary of what happened | 3,670 |
Temporal | Temporal Localization (temporal_localization) | Identify when the anomaly occurs | 3,670 |
Temporal | Causal Linkage (causal_linkage) | What caused the anomaly? | 3,670 |
Temporal | Event Description (temporal_description) | Describe what happened in the interval | 3,670 |
Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.
Annotation format example (Event Verification task):
{
"version": "metropolis-v3.0",
"metadata": {
"type": "bcq",
"date": "2026-04-14",
"description": "Binary choice QA (Yes/No answer only)",
"tags": ["so-tad", "anomaly"]
},
"items": [
{
"video_id": "main",
"question": "Does a rear-end collision occur in the video?",
"reasoning": "The video shows a nighttime scene at a four-way intersection with wet road conditions...",
"answer": "Yes"
}
]
}
Test Sets, Evaluation, and Submission
The test sets, evaluation server, submission format, and detailed evaluation metrics will be released in mid-May 2026. The evaluation will cover:
In-domain test set: Human-verified traffic anomaly videos, evaluated on the same 10 task types as training
Out-of-domain test set 1: Fisheye intersection footage (FishEye8K, https://arxiv.org/abs/2305.17449), testing generalization to a different task formulation
Out-of-domain test set 2: Egocentric dashcam video (PSI, https://neurips.cc/virtual/2025/loc/san-diego/poster/121383), testing generalization across visual domains and task types
Data Access
Training annotations can be found on Hugging Face. [The dataset URL is coming soon]
A download script is included to retrieve the source videos from their original public repositories. By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use.
