University of Southern California

GameplayQA

A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Towards Multi-Agent Video Understanding

Why synchronized multi-viewpoint reasoning matters

Many real-world tasks demand reasoning across multiple synchronized viewpoints simultaneously, where Esports and 3D gameplay offer a uniquely controlled testbed for these challenges.

🤖

Robot & Autonomous Fleets

Self-driving cars, delivery robots, and surveillance drones share egocentric camera streams to coordinate lane changes, avoid hazards, and plan collectively across agents that each see only part of the scene.

📹

Human Teams

Law enforcement officers wearing bodycams capture the same incident from different angles, requiring cross-pov referencing feeds to reconstruct a coherent timeline.

🏟️

Sports & Esports Analytics

Broadcast cameras and player POV feeds must be fused to understand formations, predict plays, and generate highlight commentary across synchronized viewpoints.

Reasoning Taxonomy

We organize perception around a triadic SelfOtherWorld entity decomposition. For each entity, we distinguish dynamic and static properties—Action vs. State for agents, Object vs. Event for the world—yielding six primitive label types.

These primitives compose into 15 task categories across three entity perspectives:

  • S Self — questions about the agent’s own actions, states, and decisions as seen from its first-person POV.
  • O Other — questions about the behavior, intent, and actions of other agents observed in the gameplay footage.
  • W World — questions about environmental objects, events, and state changes in the 3D game world.
Self-Other-World reasoning taxonomy

Annotation Software

We developed a custom multi-track timeline annotation tool purpose-built for dense, synchronized multi-POV gameplay captioning.

Example Questions

Browse diagnostic QA pairs across cognitive levels. Each slide pairs a gameplay video clip with its corresponding question.

L1 · SA-Identify

What weapon is the player currently holding at timestamp 0:12?

  • A AK-47
  • B M4A1-S
  • C AWP
  • D Desert Eagle
L2 · OA-Temporal

Which action did the enemy agent perform immediately before throwing the flashbang?

  • A Switched to knife
  • B Stopped moving behind cover
  • C Reloaded rifle
  • D Jumped onto the box
L2 · WE-Change

What world event occurred between 0:08 and 0:15 in the gameplay clip?

  • A Bomb was planted at site A
  • B Round timer expired
  • C Smoke grenade detonated at mid
  • D Teammate was eliminated
L3 · Cross-Video

Comparing Player 1's and Player 2's POVs, which agent initiated the engagement at bombsite B?

  • A Player 1 (CT side)
  • B Player 2 (T side)
  • C A third unseen agent
  • D Neither — it was simultaneous

Leaderboard

Model performance across task categories. Click any column header to sort. = best, = second best.

Model All ActRec StaRec ObjRec EvtRec SOC X-Ent TsRef TimLoc AbsRec OccCnt Order Intent SyncRef X-VOrd POV-ID
L1 · Single Reference L2 · Temporal L3 · Cross-Video

Error Source Analysis

Fine-grained error analysis across four dimensions reveals that models fail primarily on temporal and cross-video grounding rather than scene-level perception, and that game pace, video length, and number of synchronized perspectives all compound errors.

Error Rate by Distractor Type

Cross-video and temporal distractors cause the most errors

Error Rate by Game

Fast-paced shooters are the hardest

Error Rate by Video Duration

Error increases with video length

Error Rate by Number of Videos

Error scales with number of synchronized videos

Language Prior and Temporal Ablation

To disentangle visual grounding from temporal reasoning, we evaluate GPT-5 Mini under degraded input conditions: no video, a single random frame, and shuffled frames.

Condition All L1 L2 L3
Baseline (Full Video) 62.7 67.2 61.9 60.6
No Video 29.4 36.0 29.1 24.2
Random Frame 41.7 52.9 40.9 33.7
Shuffled Frames 54.8 63.1 52.6 53.4

−33.3

No Video vs. Baseline

Language priors alone are insufficient

−21.0

Random Frame vs. Baseline

Static content helps but can’t replace temporal dynamics

−7.9

Shuffled Frames vs. Baseline

Temporal ordering is critical for reasoning tasks

Team

Yunzhe Wang

Yunzhe Wang

PhD Student, CS

Runhui Xu

Runhui Xu

MS Student, CS

Kexin Zheng

Kexin Zheng

PhD Student, CS

Tianyi Zhang

Tianyi Zhang

PhD Student, CS

Jayavibhav Niranjan Kogundi

Jayavibhav N. Kogundi

MS Student, CS

Soham Hans

Soham Hans

PhD Student, CS

Volkan Ustun

Volkan Ustun

Director, HATS — ICT

Citation

@article{wang2026gameplayqa,
  title={GameplayQA: A Benchmarking Framework for Decision-Dense
         POV-Synced Multi-Video Understanding of 3D Virtual Agents},
  author={Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and
          Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and
          Hans, Soham and Ustun, Volkan},
  year={2026}
}