2026 Challenge Track Description

Track 3: Anomalous Events in Transportation

This track challenges participants to build a unified video understanding model or system capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.

The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.

  • Data 

Training Data

The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.

Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.

Source datasets:

Source Dataset

Reference

VAD-R1

https://arxiv.org/abs/2505.19877

TAD

https://arxiv.org/abs/2008.08944

Accident-Bench

https://arxiv.org/abs/2509.26636

SO-TADhttps://www.sciencedirect.com/science/article/abs/pii/S0925231224018320
TADBenchmarkhttps://arxiv.org/abs/2209.12386
Highway Traffic Videos Datasethttps://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset
UCF Crimehttps://arxiv.org/abs/1801.04264
Barbados Traffic Analysis Challengehttps://zindi.africa/competitions/barbados-traffic-analysis-challenge/data

Task types (10 tasks across 3 groups):

Task Group

Task Type

Description

Samples

Basic

Event Verification (bcq)

Binary Yes/No questions

7,340

Basic

Event Verification with Explanation (bcq_openended)

Binary Yes/No + explanation

7,340

Basic

Multiple-Choice QA (mcq)

Select the correct answer

3,670

Basic

Multiple-Choice QA with Explanation (mcq_openended)

Select the correct answer + explanation

3,670

Basic

Open-Ended QA (open_qa)

Free-form question about the anomaly

3,670

Scene

Scene Description (scene_description)

Static description of the scene

3,670

Scene

Video Summary (video_summarization)

Summary of what happened

3,670

Temporal

Temporal Localization (temporal_localization)

Identify when the anomaly occurs

3,670

Temporal

Causal Linkage (causal_linkage)

What caused the anomaly?

3,670

Temporal

Event Description (temporal_description)

Describe what happened in the interval

3,670

Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.

Annotations ship as 10 JSON files, one per task (bcq.json, bcq_openended.json, mcq.json, mcq_openended.json, open_qa.json, scene_description.json, video_summarization.json, temporal_localization.json, causal_linkage.json, temporal_description.json). Each file is a flat JSON list. Every item shares the same four identity+question fields, joined on [video_id, task_type, item_index]

Annotation format example (Event Verification task):

{
"format": "tao-vl-reason-v1.0",
"metadata": {
"type": "annotation",
"task": "bcq",
"license": "CC-BY-4.0"
},
"media_root": null,
"items": [
{
"video_id": "TAD/01_Accident_001.mp4",
"question": "Does a rear-end collision occur in the video?\nAnswer with Yes or No.",
"answer": "Yes",
"reasoning": "The video shows a nighttime scene at a four-way intersection...",
"item_index": "0"
}
]
}

video_id is a relative path of the form /.mp4, matching the layout produced by the download script (set the loader’s media_root to the script’s –out directory). item_index disambiguates multiple items for the same (video_id, metadata.task) pair (only bcq and bcq_openended use item_index ∈ {“0”, “1”}; all other tasks use item_index = “0”).

  • Test Sets, Evaluation, and Submission 

The test sets, evaluation server, submission format, and detailed evaluation metrics will be released in mid-May 2026. The evaluation will cover:

    • In-domain test set: Human-verified traffic anomaly videos, evaluated on the same 10 task types as training
    • Out-of-domain test set 1: Fisheye intersection footage (FishEye8K, https://arxiv.org/abs/2305.17449), testing generalization to a different task formulation
    • Out-of-domain test set 2: Egocentric dashcam video (PSI, https://neurips.cc/virtual/2025/loc/san-diego/poster/121383), testing generalization across visual domains and task types
  • Data Access

Training annotations can be found on Hugging Face.

A download script is included to retrieve the source videos from their original public repositories. By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use.

    •