ACL 2026 · University of Southern California
GameplayQA
A Benchmarking Framework for Decision-Dense
POV-Synced Multi-Video Understanding
of 3D Virtual Agents
Towards Multi-Agent Video Understanding
Why synchronized multi-viewpoint reasoning matters
Many real-world tasks demand reasoning across multiple synchronized viewpoints simultaneously, where Esports and 3D gameplay offer a uniquely controlled testbed for these challenges.
Robot & Autonomous Fleets
Self-driving cars, delivery robots, and surveillance drones share egocentric camera streams to coordinate lane changes, avoid hazards, and plan collectively across agents that each see only part of the scene.
Human Teams
Law enforcement officers wearing bodycams capture the same incident from different angles, requiring cross-pov referencing feeds to reconstruct a coherent timeline.
Sports & Esports Analytics
Broadcast cameras and player POV feeds must be fused to understand formations, predict plays, and generate highlight commentary across synchronized viewpoints.
Reasoning Taxonomy
We organize perception around a triadic Self–Other–World entity decomposition. For each entity, we distinguish dynamic and static properties—Action vs. State for agents, Object vs. Event for the world—yielding six primitive label types.
These primitives compose into 15 task categories across three entity perspectives:
- S Self — questions about the agent’s own actions, states, and decisions as seen from its first-person POV. SASelf Action SSSelf State
- O Other — questions about the behavior, intent, and actions of other agents observed in the gameplay footage. OAOther Action OSOther State
- W World — questions about environmental objects, events, and state changes in the 3D game world. WOWorld Object WEWorld Event
Combinatorial QA Generation
From timeline labels to question-answer pairs
The simplified interactive demo below on Cross-Entity Referring questions shows how questions can be combinatorially created based on multi-track timeline labels and question templates. Select a question template and click an anchor label on the timeline to see how a question is composed.
1 Select Template
3 Generated QA Pair
2 Timeline Annotation
Annotation Software
We developed a custom multi-track timeline annotation tool purpose-built for dense, synchronized multi-POV gameplay captioning.
Example Questions
Leaderboard
Model performance across task categories. Click any column header to sort. = best, = second best.
| Model | All | ActRec | StaRec | ObjRec | EvtRec | SOC | X-Ent | TsRef | TimLoc | AbsRec | OccCnt | Order | Intent | SyncRef | X-VOrd | POV-ID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L1 · Single Reference | L2 · Temporal | L3 · Cross-Video | ||||||||||||||
Error Source Analysis
Fine-grained error analysis across four dimensions reveals that models fail primarily on temporal and cross-video grounding rather than scene-level perception, and that game pace, video length, and number of synchronized perspectives all compound errors.
Error Rate by Distractor Type
Cross-video and temporal distractors cause the most errors
Error Rate by Game
Fast-paced shooters are the hardest
Error Rate by Video Duration
Error increases with video length
Error Rate by Number of Videos
Error scales with number of synchronized videos
Language Prior and Temporal Ablation
To disentangle visual grounding from temporal reasoning, we evaluate GPT-5 Mini under degraded input conditions: no video, a single random frame, and shuffled frames.
| Condition | All | L1 | L2 | L3 |
|---|---|---|---|---|
| Baseline (Full Video) | 62.7 | 67.2 | 61.9 | 60.6 |
| No Video | 29.4 | 36.0 | 29.1 | 24.2 |
| Random Frame | 41.7 | 52.9 | 40.9 | 33.7 |
| Shuffled Frames | 54.8 | 63.1 | 52.6 | 53.4 |
−33.3
No Video vs. Baseline
Language priors alone are insufficient
−21.0
Random Frame vs. Baseline
Static content helps but can’t replace temporal dynamics
−7.9
Shuffled Frames vs. Baseline
Temporal ordering is critical for reasoning tasks
Team
Citation
@article{wang2026gameplayqa,
title = {GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents},
author = {Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and Hans, Soham and Ustun, Volkan},
year = {2026},
eprint = {2603.24329},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2603.24329}
}