2026-track3 – AI CITY CHALLENGE

2026 Challenge Track Description

Track 3: Anomalous Events in Transportation

- [Important note — Jul 1, 2026] Because of a potential overlap between certain task prompts for the same video, we have decided to remove temporal_localization from the TAR leaderboard score. The in-domain (TAR) overall score is now the unweighted mean over the remaining 9 task types. The released annotations are unchanged; temporal_localization items remain in the dataset and may still be used for training and analysis, but they no longer contribute to the leaderboard ranking. See the Evaluation section below.

This track challenges participants to build a unified video understanding model or system capable of detecting, reasoning about, and explaining anomalous events in transportation surveillance footage. Unlike traditional anomaly detection benchmarks that focus on binary classification or localization alone, this track requires models to perform a diverse set of reasoning tasks, from simple event verification to open-ended causal analysis, all grounded in explicit chain-of-thought reasoning over temporal and spatial evidence.

The training data is generated by a hierarchical auto-labeling pipeline using state-of-the-art VLMs, while the test data is human-verified, providing a rigorous evaluation of model generalization. The track also includes two out-of-domain evaluations: fisheye traffic monitoring footage (same visual domain, different task) and egocentric dashcam video with pedestrian crossing intent prediction (different visual domain and task formulation), testing model robustness and generalization.

Track 3 is organized around three separate leaderboards, each with its own submission and scoring. Scores are not combined across them:

- TAR (in-domain) — the main Track 3 leaderboard. Human-verified CCTV anomaly-reasoning test set across the 10 task types.
- FETV (out-of-domain 1) — fisheye traffic-violation recognition. Submitted and scored through the evaluation server as Track 7.
- PSI VQA (out-of-domain 2) — egocentric dashcam pedestrian crossing-intent reasoning. Submitted and scored through the evaluation server as Track 8.

The two out-of-domain leaderboards are optional and are intended to demonstrate that a model trained for the in-domain TAR task can generalize to new visual domains and task formulations. Prize eligibility is described in the Awards and Eligibility section below.

Training Data

The training dataset consists of 44,040 annotations covering 3,670 CCTV transportation videos (965 anomalous, 2,705 normal; ~26.1 hours total) sourced from eight public open-source datasets. Annotations were generated by a three-stage VLM auto-labeling pipeline (Gemini 3.1 Pro for video captioning and structured event description; Gemma-4 for multi-task Q&A with chain-of-thought reasoning). For 910 videos, existing NVIDIA human annotations (global descriptions, event captions with timestamps, per-object bounding boxes) were used as supplementary context during annotation generation.

Videos are not included in this release. A download script is provided to retrieve the original source videos from their respective public repositories.

Source datasets:

Source Dataset	Reference
VAD-R1	https://arxiv.org/abs/2505.19877
TAD	https://arxiv.org/abs/2008.08944
Accident-Bench	https://arxiv.org/abs/2509.26636
SO-TAD	https://www.sciencedirect.com/science/article/abs/pii/S0925231224018320
TADBenchmark	https://arxiv.org/abs/2209.12386
Highway Traffic Videos Dataset	https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset
UCF Crime	https://arxiv.org/abs/1801.04264
Barbados Traffic Analysis Challenge	https://zindi.africa/competitions/barbados-traffic-analysis-challenge/data

Task types (10 tasks across 3 groups):

Task Group	Task Type	Description	Samples
Basic	Event Verification (bcq)	Binary Yes/No questions	7,340
Basic	Event Verification with Explanation (bcq_openended)	Binary Yes/No + explanation	7,340
Basic	Multiple-Choice QA (mcq)	Select the correct answer	3,670
Basic	Multiple-Choice QA with Explanation (mcq_openended)	Select the correct answer + explanation	3,670
Basic	Open-Ended QA (open_qa)	Free-form question about the anomaly	3,670
Scene	Scene Description (scene_description)	Static description of the scene	3,670
Scene	Video Summary (video_summarization)	Summary of what happened	3,670
Temporal	Temporal Localization (temporal_localization)	Identify when the anomaly occurs	3,670
Temporal	Causal Linkage (causal_linkage)	What caused the anomaly?	3,670
Temporal	Event Description (temporal_description)	Describe what happened in the interval	3,670

Note: bcq and bcq_openended have 2 samples per video (one Yes, one No answer), hence 7,340 each.

Annotations ship as 10 JSON files, one per task (bcq.json, bcq_openended.json, mcq.json, mcq_openended.json, open_qa.json, scene_description.json, video_summarization.json, temporal_localization.json, causal_linkage.json, temporal_description.json). Each file is a flat JSON list. Every item shares the same four identity+question fields, joined on [video_id, task_type, item_index]

Annotation format example (Event Verification task):

{
  "format": "tao-vl-reason-v1.0",
  "metadata": {
    "type": "annotation",
    "task": "bcq",
    "license": "CC-BY-4.0"
  },
  "media_root": null,
  "items": [
    {
      "video_id": "TAD/01_Accident_001.mp4",
      "question": "Does a rear-end collision occur in the video?\nAnswer with Yes or No.",
      "answer": "Yes",
      "reasoning": "The video shows a nighttime scene at a four-way intersection...",
      "item_index": "0"
    }
  ]
}

video_id is a relative path of the form /.mp4, matching the layout produced by the download script (set the loader’s media_root to the script’s –out directory).

Data Access

Training annotations can be found on Hugging Face.

A download script is included to retrieve the source videos from their original public repositories — all eight sources are fetched automatically (the script runs the per-source post-processing needed to match annotation paths). By downloading and using the data, participants agree to comply with each source dataset’s license and terms of use, and to cite the corresponding paper or dataset release for each source they use in addition to the AI City Challenge 2026 Track 3 release.

In-Domain Test Sets

Our in-domain TAR test set comprises 960 human-curated annotations covering 80 short clips trimmed from 17 public YouTube videos, evaluated on the same 10 task types as TAR training set.

The TAR test set is released under test split in the TAR Hugging Face dataset (https://huggingface.co/datasets/nvidia/PhysicalAI-Traffic-Anomaly-Reasoning#test-split ):

- test.json — 960 items in tao-vl-reason-v1.0 format, answers redacted.
- clip_manifest.csv — per-clip YouTube source URL and start/end timestamps.
- download_test_videos.py — yt-dlp + ffmpeg helper that downloads each source video once and trims it into the per-clip files referenced by test.json.
- evaluate.py — submission validator + scorer (auto-detects the redacted answers and runs format validation only on the public test.json).
- submission.example.csv — reference rows showing the expected (item_index, prediction) shape for each prediction format.

Submission Format

A single CSV with two columns — item_index (the 16-hex sample id from test/test.json; the join key) and prediction (the output text from the model or the system). Multi-line predictions are fine; pandas CSV quoting handles them.

item_index,prediction
bfaa0b67a0385860,Yes.
b944ca5ad1567362,Yes. There is a collision.
fcc257a9dfd308b8,A
b22a0fcaac174951,"```json
{""start"": ""00:00"", ""end"": ""00:01""}
```"

Evaluation

Task type	Metric
BCQ	yes/no accuracy (regex extraction)
MCQ	Letter accuracy (regex extraction)
Temporal localization	~~Mean IoU over {“start”, “end”} JSON predictions.~~ Not scored on the leaderboard (excluded from the overall score as of Jul 1, 2026). Items remain in the dataset.
Open-ended tasks (bcq_openended, mcq_openended, open_qa, causal_linkage, scene_description, temporal_description, video_summarization)	BERTScore F1 (roberta-large, rescale_with_baseline=True)
Overall	Unweighted mean of the per-task metrics above, over the 9 scored task types (temporal_localization excluded).

Here is a reference implementation of the scoring above. You can point it at any tao-vl-reason-v1.0 GT file with real answers (e.g. a held-out validation subset) to compute the per-task metrics locally during development.

The public leaderboard is live, with baseline submissions posted: https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=3&type=general . Please check the Evaluation System section for submission guidelines and Public vs General leaderboard.

Out-of-Domain Test Sets

Out-of-Domain Test Set 1: FishEye Traffic Violation (FETV) dataset

FETV is an optional out-of-domain leaderboard, submitted and scored through the evaluation server as Track 7. The test set consists of 200 short video clips extracted from Fisheye8K source videos, featuring fisheye footage of traffic violations at intersections. The dataset covers 7 violation types: no_violation, lane_discipline, jaywalking, wrong_way, lane_use_control, red_light, and uturn (U-turn).

Annotations

Each clip is annotated by human annotators with 12 structured target variables plus a free-form caption (initially drafted by LLMs, then manually revised and verified by human annotators). For each clip, teams must produce all 12 structured target variables and a caption.

Field	Options
date	YYYY-MM-DD
time	HH:MM:SS
violation_type	wrong_way, uturn, jaywalking, red_light, lane_use_control, lane_discipline, no_violation
violator_type	car, motorcycle, pedestrian, bus, truck, na
color	dark, light, red, green, yellow, blue, mixed, na.
initial_position	Top-Left, Top-Center, Top-Right, Middle-Left, Middle-Center, Middle-Right, Bottom-Left, Bottom-Center, Bottom-Right, na.
final_position	Top-Left, Top-Center, Top-Right, Middle-Left, Middle-Center, Middle-Right, Bottom-Left, Bottom-Center, Bottom-Right, na.
initial_lane	1, 2, 3, 4, na.
final_lane	1, 2, 3, 4, na.
intersection_type	T-intersection, four-way intersection
weather	clear, rainy, cloudy
light	daylight, night

Access to the dataset and more detailed documentation can be found here: https://github.com/MoyoG/FETV. Participants are also free to use the previously released Fisheye8K dataset as supplementary training data, subject to the data license.

Submission Format

A flat JSON array, one object per clip, containing clip_name and the 12 answer_* fields plus answer_description.

[
  {
    "clip_name": "001_000.mp4",
    "answer_date": "2026-01-01",
    "answer_time": "12:34:56",
    "answer_violation_type": "wrong_way",
    "answer_violator_type": "car",
    "answer_color": "light",
    "answer_initial_position": "Top-Left",
    "answer_initial_lane": "1",
    "answer_final_position": "Middle-Right",
    "answer_final_lane": "2",
    "answer_intersection_type": "T-intersection",
    "answer_weather": "clear",
    "answer_light": "daylight",
    "answer_description": "Dummy event."
  }
]

Evaluation

Categorical fields are scored with macro-averaged F1. The date field is scored as an exact match; the time field is scored as a binary match with a 7-second tolerance. The caption is scored with the average of normalized CIDEr and BERTScore. The final FETV score is:

S_FETV = 0.25 · CIDErnorm + 0.25 · BERTScore + 0.5 · MacroF1

where MacroF1 is the macro-averaged F1 over the categorical target variables together with the date-match and 7-second time-tolerance scores. The leaderboard reports the final score and the per-field F1 scores so teams can see which variables they do well on.

FETV leaderboard is here: https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=7&type=general. Please check the Evaluation System section for submission guidelines and Public vs General leaderboard. Please read the “Awards and Eligibility” section below for prize eligibility.

Out-of-Domain Test Set 2: Egocentric Dashcam Pedestrian Intent (PSI VQA)

PSI VQA is an optional out-of-domain leaderboard, submitted and scored through the evaluation server as Track 8. The test set consists of 40 egocentric dashcam video clips from the PSI 2.0 dataset, featuring pedestrian crossing scenarios from the driver’s perspective. This is a significant domain shift from the CCTV training data in both visual domain (egocentric vs. overhead surveillance) and task formulation (pedestrian intent reasoning vs. anomaly reasoning).

Annotations

PSI VQA defines four sub-tasks that reuse the in-domain task types directly, so the same unified model can be evaluated across the domain shift. In each clip the target pedestrian is marked with a red bounding box during the first second of the clip.

- PSI-T1: BCQ — Crossing-Intent Binary Classification.
  - Videos where annotators agreed on the target pedestrian’s intent. Output “Yes” or “No”: does this pedestrian intend to cross in front of the ego-vehicle within the observation window?
- PSI-T2: Open QA — Ambiguous-Intent Cue Articulation.
  - Videos where annotators disagreed about intent. For each video, three independent sub-questions are asked: why the pedestrian might intend to cross, might NOT intend to cross, and why the intent might be uncertain. The model answers each with a bulleted list of visual cues, or “None” if no supporting cues exist.
- PSI-T3: MCQ — Cue Identification with Mixed Distractors.
  - Same ambiguous-intent videos as PSI-T2, plus a specific intent sub-question and four options. Output a single letter A, B, C, or D for the option that best describes the visual evidence for the queried intent.
- PSI-T4: Temporal Localization — Driver-Decision Critical Interval.
  - All test videos. Output JSON in the form {“start”: “MM:SS”, “end”: “MM:SS”} for the interval during which a road user (pedestrian, cyclist, or vehicle) or road factor (signal, sign, or road condition) most influences the driver’s decision-making.

The PSI VQA test set is gated under the TASI Benchmark Data Sharing Agreement, which restricts use to academic and non-commercial research:https://huggingface.co/datasets/ise-ice-lab/PSI_VQA. Participants are also free to use the PSI_VQA training split and/or the previously released PSI 2.0 dataset as supplementary training data, subject to the data license.

Submission Format

PSI predictions follow the same CSV schema as in-domain submissions. The submission is expected to be a single CSV with columns item_index (the per-task item id; the join key) and prediction (raw model output). The expected prediction per sub-task:

- PSI-T1 (BCQ): text starting with “Yes” or “No”.
- PSI-T2 (Open QA): a bulleted cue list, or “None”.
- PSI-T3 (MCQ): a single letter A, B, C, or D.
- PSI-T4 (Temporal Localization): JSON in the form {“start”: “MM:SS”, “end”: “MM:SS”}.

item_index,prediction
af612fe6c7a21ab1,Yes
490bc8f79dc97e68,A
81e15aa0170b2aa9,"- Bullet point 1.
- Bullet point 2."
bf38384086599009,None
46242a7a89cabe21,"{""start"": ""00:01"", ""end"": ""00:02""}"

Evaluation

Each sub-task is scored with the metric used for the corresponding in-domain task type:

- PSI-T1 (BCQ): Macro-F1, with Accuracy reported as secondary.
- PSI-T2 (Open QA): Cue-level F1 using sentence-transformer (all-MiniLM-L6-v2) semantic matching at cosine threshold 0.55. Prediction cues are matched against GT cues; precision and recall are computed at the cue level and averaged into F1. Where the GT is “None”, predicting “None” scores 1.0 and predicting cues scores 0.0.
- PSI-T3 (MCQ): Accuracy.
- PSI-T4 (Temporal Localization): Mean Temporal IoU; unparseable predictions count as IoU = 0.

The four sub-task scores are normalized to [0, 100] and combined with equal weight into the overall PSI VQA score:

S_PSI = 0.25 · PSI-T1 + 0.25 · PSI-T2 + 0.25 · PSI-T3 + 0.25 · PSI-T4

PSI VQA leaderboard is here: https://eval.aicitychallenge.org/aicity2026/submission/leaderboard?trackId=8&type=general. Please check the Evaluation System section for submission guidelines and Public vs General leaderboard. Please read the “Awards and Eligibility” section below for prize eligibility.

Awards and Eligibility

Track 3 offers prizes on each of the three leaderboards:

- TAR (main, in-domain): winner receives an RTX GPU; runner-up receives an NVIDIA Jetson Orin Nano Super Developer Kit.
- FETV (OOD 1): the top team receives an NVIDIA Jetson Orin Nano Super Developer Kit.
- PSI VQA (OOD 2): the top team receives an NVIDIA Jetson Orin Nano Super Developer Kit.

To keep each out-of-domain award a demonstration of generalization from the in-domain TAR task, eligibility is as follows:

- TAR prize: a valid submission to the TAR leaderboard.
- FETV prize: valid submissions to both TAR and FETV.
- PSI VQA prize: valid submissions to both TAR and PSI VQA.

Teams may enter any subset of the three leaderboards; entering all three is not required.

2026 Challenge Track Description

Training Data

Source datasets:

Task types (10 tasks across 3 groups):

Annotation format example (Event Verification task):

Data Access

In-Domain Test Sets

Submission Format

Evaluation

Out-of-Domain Test Sets

Out-of-Domain Test Set 1: FishEye Traffic Violation (FETV) dataset

Annotations

Submission Format

Evaluation

Out-of-Domain Test Set 2: Egocentric Dashcam Pedestrian Intent (PSI VQA)

Annotations

Submission Format

Evaluation

Awards and Eligibility