2026 Challenge Track Description

Track 3: Anomalous Events in Transportation

This track challenges participants to build a unified video understanding model or system capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.

The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.

  • Data 

Training Data

The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.

Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.

Source datasets:

Source Dataset

Reference

VAD-R1

https://arxiv.org/abs/2505.19877

TAD

https://arxiv.org/abs/2008.08944

Accident-Bench

https://arxiv.org/abs/2509.26636

SO-TADhttps://www.sciencedirect.com/science/article/abs/pii/S0925231224018320
TADBenchmarkhttps://arxiv.org/abs/2209.12386
Highway Traffic Videos Datasethttps://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset
UCF Crimehttps://arxiv.org/abs/1801.04264
Barbados Traffic Analysis Challengehttps://zindi.africa/competitions/barbados-traffic-analysis-challenge/data

Task types (10 tasks across 3 groups):

Task Group

Task Type

Description

Samples

Basic

Event Verification (bcq)

Binary Yes/No questions

7,340

Basic

Event Verification with Explanation (bcq_openended)

Binary Yes/No + explanation

7,340

Basic

Multiple-Choice QA (mcq)

Select the correct answer

3,670

Basic

Multiple-Choice QA with Explanation (mcq_openended)

Select the correct answer + explanation

3,670

Basic

Open-Ended QA (open_qa)

Free-form question about the anomaly

3,670

Scene

Scene Description (scene_description)

Static description of the scene

3,670

Scene

Video Summary (video_summarization)

Summary of what happened

3,670

Temporal

Temporal Localization (temporal_localization)

Identify when the anomaly occurs

3,670

Temporal

Causal Linkage (causal_linkage)

What caused the anomaly?

3,670

Temporal

Event Description (temporal_description)

Describe what happened in the interval

3,670

Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.

Annotations ship as 10 JSON files, one per task (bcq.json, bcq_openended.json, mcq.json, mcq_openended.json, open_qa.json, scene_description.json, video_summarization.json, temporal_localization.json, causal_linkage.json, temporal_description.json). Each file is a flat JSON list. Every item shares the same four identity+question fields, joined on [video_id, task_type, item_index]

Annotation format example (Event Verification task):

{
"format": "tao-vl-reason-v1.0",
"metadata": {
"type": "annotation",
"task": "bcq",
"license": "CC-BY-4.0"
},
"media_root": null,
"items": [
{
"video_id": "TAD/01_Accident_001.mp4",
"question": "Does a rear-end collision occur in the video?\nAnswer with Yes or No.",
"answer": "Yes",
"reasoning": "The video shows a nighttime scene at a four-way intersection...",
"item_index": "0"
}
]
}

video_id is a relative path of the form /.mp4, matching the layout produced by the download script (set the loader’s media_root to the script’s –out directory).

  • Data Access

Training annotations can be found on Hugging Face.

A download script is included to retrieve the source videos from their original public repositories — all eight sources are fetched automatically (the script runs the per-source post-processing needed to match annotation paths). By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use, and to cite the corresponding paper or dataset release for each source they use in addition to the AI City Challenge 2026 Track 3 release.

  • Test Sets 

Our in-domain TAR test set comprises 960 human-curated annotations covering 80 short clips trimmed from 17 public YouTube videos, evaluated on the same 10 task types as TAR training set.

The TAR test set is released under test split in the TAR Hugging Face dataset (https://huggingface.co/datasets/nvidia/PhysicalAI-Traffic-Anomaly-Reasoning#test-split ):

    • test.json — 960 items in tao-vl-reason-v1.0 format, answers redacted.
    • clip_manifest.csv — per-clip YouTube source URL and start/end timestamps.
    • download_test_videos.py — yt-dlp + ffmpeg helper that downloads each source video once and trims it into the per-clip files referenced by test.json.
    • evaluate.py — submission validator + scorer (auto-detects the redacted answers and runs format validation only on the public test.json).
    • submission.example.csv — reference rows showing the expected (item_index, prediction) shape for each prediction format.
  • Submission Format 

A single CSV with two columns — item_index (the 16-hex sample id from test/test.json; the join key) and prediction (the output text from the model or the system). Multi-line predictions are fine; pandas CSV quoting handles them.

item_index,prediction
bfaa0b67a0385860,Yes.
b944ca5ad1567362,Yes. There is a collision.
fcc257a9dfd308b8,A
b22a0fcaac174951,"```json
{""start"": ""00:00"", ""end"": ""00:01""}
```"
  • Evaluation 

Task type

Metric

BCQ

yes/no accuracy (regex extraction)

MCQ

Letter accuracy (regex extraction)

Temporal localization

Mean IoU over {“start”, “end”} JSON predictions.

Open-ended tasks (bcq_openended, mcq_openended, open_qa, causal_linkage, scene_description, temporal_description, video_summarization)

BERTScore F1 (roberta-large, rescale_with_baseline=True)

Overall

Unweighted mean of the per-task metrics above

​​Here is a reference implementation of the scoring above. You can point it at any tao-vl-reason-v1.0 GT file with real answers (e.g. a held-out validation subset) to compute the per-task metrics locally during development.

The public leaderboard is live, with baseline submissions posted: https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=3&type=general 

    •