ACL 2026 · University of Southern California

GameplayQA

A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Towards Multi-Agent Video Understanding

Why synchronized multi-viewpoint reasoning matters

Many real-world tasks demand reasoning across multiple synchronized viewpoints simultaneously, where Esports and 3D gameplay offer a uniquely controlled testbed for these challenges.

🤖

Robot & Autonomous Fleets

Self-driving cars, delivery robots, and surveillance drones share egocentric camera streams to coordinate lane changes, avoid hazards, and plan collectively across agents that each see only part of the scene.

📹

Human Teams

Law enforcement officers wearing bodycams capture the same incident from different angles, requiring cross-pov referencing feeds to reconstruct a coherent timeline.

🏟️

Sports & Esports Analytics

Broadcast cameras and player POV feeds must be fused to understand formations, predict plays, and generate highlight commentary across synchronized viewpoints.

Reasoning Taxonomy

We organize perception around a triadic SelfOtherWorld entity decomposition. For each entity, we distinguish dynamic and static properties—Action vs. State for agents, Object vs. Event for the world—yielding six primitive label types.

These primitives compose into 15 task categories across three entity perspectives:

  • S Self — questions about the agent’s own actions, states, and decisions as seen from its first-person POV. SASelf Action SSSelf State
  • O Other — questions about the behavior, intent, and actions of other agents observed in the gameplay footage. OAOther Action OSOther State
  • W World — questions about environmental objects, events, and state changes in the 3D game world. WOWorld Object WEWorld Event
Self-Other-World reasoning taxonomy

Combinatorial QA Generation

From timeline labels to question-answer pairs

The simplified interactive demo below on Cross-Entity Referring questions shows how questions can be combinatorially created based on multi-track timeline labels and question templates. Select a question template and click an anchor label on the timeline to see how a question is composed.

1 Select Template

3 Generated QA Pair

Select a template and click an anchor on the timeline…

2 Timeline Annotation

Annotation Software

We developed a custom multi-track timeline annotation tool purpose-built for dense, synchronized multi-POV gameplay captioning.

Example Questions

L1 · WO-Count

How many Rowa Fruit bushes are visible in this scene?

  • A 4 bushes
  • B 2 bushes
  • C 3 bushes
  • D 5 bushes
L1 · OA-IDENT

What action did the teammate perform in the video?

  • A Standing still
  • B Jumping while shooting
  • C Deploying the Dome of Protection
  • D Shooting through an open doorway
L2 · WO-TIME

When did the red laser beams appear?

  • A From 00:00 to 00:03
  • B From 00:13 to 00:15
  • C From 00:05 to 00:08
  • D From 00:09 to 00:12
L2 · SA-Intent

Why did the POV player place a torch?

  • A Prevent hostile mobs from spawning
  • B Signal teammates nearby
  • C Mark the path for reference
  • D Light up the surrounding area
L3 · SA-POV-ID

The POV player in which video reaches inside the ambulance?

  • A Video 1
  • B Video 2
  • C Video 3
  • D Video 4
L3 · SA2V-SA-IDENT

While the player in Video 1 was switching to a Molotov cocktail, what was the player in Video 2 doing at the same time?

  • A Planting the bomb at the A site
  • B Throwing a smoke grenade
  • C Running forward with their rifle out
  • D Running forward with their knife out
L3 · SA-Order-MV

Which of the following sequences shows the correct order of events?

  • A 1. The POV player in Video 2 exits the dark tunnel; 2. The POV player in Video 2 ascends the elevator shaft via zipline; 3. The POV player in Video 1 runs to the outside area
  • B 1. The POV player in Video 2 exits the dark tunnel; 2. The POV player in Video 1 runs to the outside area; 3. The POV player in Video 2 ascends the elevator shaft via zipline
  • C 1. The POV player in Video 2 ascends the elevator shaft via zipline; 2. The POV player in Video 2 exits the dark tunnel; 3. The POV player in Video 1 runs to the outside area
  • D 1. The POV player in Video 1 runs to the outside area; 2. The POV player in Video 2 exits the dark tunnel; 3. The POV player in Video 2 ascends the elevator shaft via zipline

Leaderboard

Model performance across task categories. Click any column header to sort. = best, = second best.

Model All ActRec StaRec ObjRec EvtRec SOC X-Ent TsRef TimLoc AbsRec OccCnt Order Intent SyncRef X-VOrd POV-ID
L1 · Single Reference L2 · Temporal L3 · Cross-Video

Error Source Analysis

Fine-grained error analysis across four dimensions reveals that models fail primarily on temporal and cross-video grounding rather than scene-level perception, and that game pace, video length, and number of synchronized perspectives all compound errors.

Error Rate by Distractor Type

Cross-video and temporal distractors cause the most errors

Error Rate by Game

Fast-paced shooters are the hardest

Error Rate by Video Duration

Error increases with video length

Error Rate by Number of Videos

Error scales with number of synchronized videos

Language Prior and Temporal Ablation

To disentangle visual grounding from temporal reasoning, we evaluate GPT-5 Mini under degraded input conditions: no video, a single random frame, and shuffled frames.

Condition All L1 L2 L3
Baseline (Full Video) 62.7 67.2 61.9 60.6
No Video 29.4 36.0 29.1 24.2
Random Frame 41.7 52.9 40.9 33.7
Shuffled Frames 54.8 63.1 52.6 53.4

−33.3

No Video vs. Baseline

Language priors alone are insufficient

−21.0

Random Frame vs. Baseline

Static content helps but can’t replace temporal dynamics

−7.9

Shuffled Frames vs. Baseline

Temporal ordering is critical for reasoning tasks

Team

Yunzhe Wang

Yunzhe Wang

PhD Student, CS

Runhui Xu

Runhui Xu

MS Student, CS

Kexin Zheng

Kexin Zheng

PhD Student, CS

Tianyi Zhang

Tianyi Zhang

PhD Student, CS

Jayavibhav Niranjan Kogundi

Jayavibhav N. Kogundi

MS Student, CS

Soham Hans

Soham Hans

PhD Student, CS

Volkan Ustun

Volkan Ustun

Director, HATS — ICT

Citation

@article{wang2026gameplayqa,
  title   = {GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents},
  author  = {Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and Hans, Soham and Ustun, Volkan},
  year    = {2026},
  eprint  = {2603.24329},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url     = {https://arxiv.org/abs/2603.24329}
}