2026 Challenge Track Description
Track 3: Anomalous Events in Transportation
This track challenges participants to build a unified video understanding model or system capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.
The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.
Data
Training Data
The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.
Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.
Source datasets:
Source Dataset | Reference |
| VAD-R1 | |
TAD | |
Accident-Bench | |
| SO-TAD | https://www.sciencedirect.com/science/article/abs/pii/S0925231224018320 |
| TADBenchmark | https://arxiv.org/abs/2209.12386 |
| Highway Traffic Videos Dataset | https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset |
| UCF Crime | https://arxiv.org/abs/1801.04264 |
| Barbados Traffic Analysis Challenge | https://zindi.africa/competitions/barbados-traffic-analysis-challenge/data |
Task types (10 tasks across 3 groups):
Task Group | Task Type | Description | Samples |
Basic | Event Verification (bcq) | Binary Yes/No questions | 7,340 |
Basic | Event Verification with Explanation (bcq_openended) | Binary Yes/No + explanation | 7,340 |
Basic | Multiple-Choice QA (mcq) | Select the correct answer | 3,670 |
Basic | Multiple-Choice QA with Explanation (mcq_openended) | Select the correct answer + explanation | 3,670 |
Basic | Open-Ended QA (open_qa) | Free-form question about the anomaly | 3,670 |
Scene | Scene Description (scene_description) | Static description of the scene | 3,670 |
Scene | Video Summary (video_summarization) | Summary of what happened | 3,670 |
Temporal | Temporal Localization (temporal_localization) | Identify when the anomaly occurs | 3,670 |
Temporal | Causal Linkage (causal_linkage) | What caused the anomaly? | 3,670 |
Temporal | Event Description (temporal_description) | Describe what happened in the interval | 3,670 |
Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.
Annotations ship as 10 JSON files, one per task (bcq.json, bcq_openended.json, mcq.json, mcq_openended.json, open_qa.json, scene_description.json, video_summarization.json, temporal_localization.json, causal_linkage.json, temporal_description.json). Each file is a flat JSON list. Every item shares the same four identity+question fields, joined on [video_id, task_type, item_index]
Annotation format example (Event Verification task):
{
"format": "tao-vl-reason-v1.0",
"metadata": {
"type": "annotation",
"task": "bcq",
"license": "CC-BY-4.0"
},
"media_root": null,
"items": [
{
"video_id": "TAD/01_Accident_001.mp4",
"question": "Does a rear-end collision occur in the video?\nAnswer with Yes or No.",
"answer": "Yes",
"reasoning": "The video shows a nighttime scene at a four-way intersection...",
"item_index": "0"
}
]
}
video_id is a relative path of the form /.mp4, matching the layout produced by the download script (set the loader’s media_root to the script’s –out directory).
Data Access
Training annotations can be found on Hugging Face.
A download script is included to retrieve the source videos from their original public repositories — all eight sources are fetched automatically (the script runs the per-source post-processing needed to match annotation paths). By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use, and to cite the corresponding paper or dataset release for each source they use in addition to the AI City Challenge 2026 Track 3 release.
Test Sets
Our in-domain TAR test set comprises 960 human-curated annotations covering 80 short clips trimmed from 17 public YouTube videos, evaluated on the same 10 task types as TAR training set.
The TAR test set is released under test split in the TAR Hugging Face dataset (https://huggingface.co/datasets/nvidia/PhysicalAI-Traffic-Anomaly-Reasoning#test-split ):
- test.json — 960 items in tao-vl-reason-v1.0 format, answers redacted.
- clip_manifest.csv — per-clip YouTube source URL and start/end timestamps.
- download_test_videos.py — yt-dlp + ffmpeg helper that downloads each source video once and trims it into the per-clip files referenced by test.json.
- evaluate.py — submission validator + scorer (auto-detects the redacted answers and runs format validation only on the public test.json).
- submission.example.csv — reference rows showing the expected (item_index, prediction) shape for each prediction format.
Submission Format
A single CSV with two columns — item_index (the 16-hex sample id from test/test.json; the join key) and prediction (the output text from the model or the system). Multi-line predictions are fine; pandas CSV quoting handles them.
item_index,prediction
bfaa0b67a0385860,Yes.
b944ca5ad1567362,Yes. There is a collision.
fcc257a9dfd308b8,A
b22a0fcaac174951,"```json
{""start"": ""00:00"", ""end"": ""00:01""}
```"
Evaluation
Task type | Metric |
BCQ | yes/no accuracy (regex extraction) |
MCQ | Letter accuracy (regex extraction) |
Temporal localization | Mean IoU over {“start”, “end”} JSON predictions. |
Open-ended tasks (bcq_openended, mcq_openended, open_qa, causal_linkage, scene_description, temporal_description, video_summarization) | BERTScore F1 (roberta-large, rescale_with_baseline=True) |
Overall | Unweighted mean of the per-task metrics above |
Here is a reference implementation of the scoring above. You can point it at any tao-vl-reason-v1.0 GT file with real answers (e.g. a held-out validation subset) to compute the per-task metrics locally during development.
The public leaderboard is live, with baseline submissions posted: https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=3&type=general
