University of Southern California

GameplayQA

A Benchmarking Framework for Decision-Dense
POV-Synced Multi-Video Understanding
of 3D Virtual Agents

Yunzhe Wang · Runhui Xu · Kexin Zheng · Tianyi Zhang · Jayavibhav N. Kogundi · Soham Hans · Volkan Ustun

arXiv Benchmark Code Annotation Software Annotation Live Demo 🤗Dataset Annotation Demo

Towards Multi-Agent Video Understanding

Why synchronized multi-viewpoint reasoning matters

Many real-world tasks demand reasoning across multiple synchronized viewpoints simultaneously, where Esports and 3D gameplay offer a uniquely controlled testbed for these challenges.

🤖

Robot & Autonomous Fleets

Self-driving cars, delivery robots, and surveillance drones share egocentric camera streams to coordinate lane changes, avoid hazards, and plan collectively across agents that each see only part of the scene.

📹

Human Teams

Law enforcement officers wearing bodycams capture the same incident from different angles, requiring cross-pov referencing feeds to reconstruct a coherent timeline.

🏟️

Sports & Esports Analytics

Broadcast cameras and player POV feeds must be fused to understand formations, predict plays, and generate highlight commentary across synchronized viewpoints.

Reasoning Taxonomy

We organize perception around a triadic Self–Other–World entity decomposition. For each entity, we distinguish dynamic and static properties—Action vs. State for agents, Object vs. Event for the world—yielding six primitive label types.

These primitives compose into 15 task categories across three entity perspectives:

S Self — questions about the agent’s own actions, states, and decisions as seen from its first-person POV.
O Other — questions about the behavior, intent, and actions of other agents observed in the gameplay footage.
W World — questions about environmental objects, events, and state changes in the 3D game world.

Annotation Software

We developed a custom multi-track timeline annotation tool purpose-built for dense, synchronized multi-POV gameplay captioning.

View on GitHub Play around the app live

Example Questions

Browse diagnostic QA pairs across cognitive levels. Each slide pairs a gameplay video clip with its corresponding question.

L1 · SA-Identify

What weapon is the player currently holding at timestamp 0:12?

A AK-47
B M4A1-S
C AWP
D Desert Eagle

L2 · OA-Temporal

Which action did the enemy agent perform immediately before throwing the flashbang?

A Switched to knife
B Stopped moving behind cover
C Reloaded rifle
D Jumped onto the box

L2 · WE-Change

What world event occurred between 0:08 and 0:15 in the gameplay clip?

A Bomb was planted at site A
B Round timer expired
C Smoke grenade detonated at mid
D Teammate was eliminated

L3 · Cross-Video

Comparing Player 1's and Player 2's POVs, which agent initiated the engagement at bombsite B?

A Player 1 (CT side)
B Player 2 (T side)
C A third unseen agent
D Neither — it was simultaneous

Leaderboard

Model performance across task categories. Click any column header to sort. = best, = second best.

Model	All	ActRec	StaRec	ObjRec	EvtRec	SOC	X-Ent	TsRef	TimLoc	AbsRec	OccCnt	Order	Intent	SyncRef	X-VOrd	POV-ID
		L1 · Single Reference					L2 · Temporal							L3 · Cross-Video

Error Source Analysis

Fine-grained error analysis across four dimensions reveals that models fail primarily on temporal and cross-video grounding rather than scene-level perception, and that game pace, video length, and number of synchronized perspectives all compound errors.

Error Rate by Distractor Type

Cross-video and temporal distractors cause the most errors

Error Rate by Game

Fast-paced shooters are the hardest

Error Rate by Video Duration

Error increases with video length

Error Rate by Number of Videos

Error scales with number of synchronized videos

Language Prior and Temporal Ablation

To disentangle visual grounding from temporal reasoning, we evaluate GPT-5 Mini under degraded input conditions: no video, a single random frame, and shuffled frames.

Condition	All	L1	L2	L3
Baseline (Full Video)	62.7	67.2	61.9	60.6
No Video	29.4	36.0	29.1	24.2
Random Frame	41.7	52.9	40.9	33.7
Shuffled Frames	54.8	63.1	52.6	53.4

−33.3

No Video vs. Baseline

Language priors alone are insufficient

−21.0

Random Frame vs. Baseline

Static content helps but can’t replace temporal dynamics

−7.9

Shuffled Frames vs. Baseline

Temporal ordering is critical for reasoning tasks

Team

Yunzhe Wang

PhD Student, CS

Runhui Xu

MS Student, CS

Kexin Zheng

PhD Student, CS

Tianyi Zhang

PhD Student, CS

Jayavibhav N. Kogundi

MS Student, CS

Soham Hans

PhD Student, CS

Volkan Ustun

Director, HATS — ICT

Citation

@article{wang2026gameplayqa,
  title={GameplayQA: A Benchmarking Framework for Decision-Dense
         POV-Synced Multi-Video Understanding of 3D Virtual Agents},
  author={Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and
          Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and
          Hans, Soham and Ustun, Volkan},
  year={2026}
}