Abstract
Language-driven generative agents have enabled large-scale social simulations with transformative uses, from interpersonal training to aiding global policy-making. However, recent studies indicate that generative agent behaviors often deviate from expert expectations and real-world data—a phenomenon we term the Behavior-Realism Gap. To address this, we introduce a theoretical framework called Persona-Environment Behavioral Alignment (PEBA), formulated as a distribution matching problem grounded in Lewin's behavior equation stating that behavior is a function of the person and their environment. Leveraging PEBA, we propose PersonaEvolve (PEvo), an LLM-based optimization algorithm that iteratively refines agent personas, implicitly aligning their collective behaviors with realistic expert benchmarks within a specified environmental context. We validate PEvo in an active shooter incident simulation we developed, achieving an 84% average reduction in distributional divergence compared to no steering and a 34% improvement over explicit instruction baselines. Results also show PEvo-refined personas generalize to novel, related simulation scenarios. Our method greatly enhances behavioral realism and reliability in high-stakes social simulations. More broadly, the PEBA-PEvo framework provides a principled approach to developing trustworthy LLM-driven social simulations.
Key Takeaways
1. Generative Agents Enable Scalable Social Science
LLM-driven generative agents make it possible to simulate and test social theories at scale, offering a practical realization of Generative Social Science, enabling research on scenarios that are otherwise infeasible or unethical with human participants.
2. Unity-Based Active Shooter Incident Simulator
We developed a Unity-based Active Shooter Incident simulation to explore high-stakes crowd dynamics, providing a flexible environment for testing generative agent behaviors under stress.
3. Closing the Behavior-Realism Gap with PEBA–PEvo
To address the Behavioral-Realism Gap, we introduced the PEBA-PEvo framework, which refines agent personas for implicit alignment with expert behavioral distributions in critical simulations.
Implicit Enforcing Achieves Superior Contextual Realism

We evaluate three behavior enforcing approaches: (1) No Enforcing, where agents act solely according to initial personas, exposing the raw behavior-realism gap; (2) Explicit Enforcing, which directly instructs agents to perform specific behaviors (e.g., "always choose Hide"), but harms contextual realism as agents follow orders regardless of environmental context; and (3) Implicit Enforcing (PEVO), which iteratively refines agent personas (e.g., updating a security guard to "former combat medic with high assertiveness") to naturally elicit target behaviors through emergent reasoning. Across four LLM models, PEVO achieves superior behavioral alignment, reducing the average distributional gap from 0.47 (no steering) to 0.16—an 84% improvement over baselines and 34% over explicit methods. The implicit approach preserves contextual realism by allowing agents to reason naturally about their environment while achieving the desired collective behavior distribution.
Learned Persona Pools are Cross-Environment Transferable
We tested personas optimized in school environments by transferring them to a novel office building scenario. Transferred personas achieved a 97.5% reduction in KL divergence compared to unoptimized baselines. While fully retrained personas performed slightly better, transferred personas retained 57% of the optimization benefit, demonstrating effective cross-environment generalization.

PEVO is Cost-Effective and Model-Dependent in Efficiency


Cost analysis reveals that prompt tokens dominate usage (over 95% of total), making model pricing the primary driver of expense. Per iteration, GPT-4.1 Mini is most expensive (~$0.60), while DeepSeek-V3 is most economical (~$0.20), with GPT-4o Mini and Gemini 2.5 Flash falling between. When normalized by KL improvement per dollar, DeepSeek-V3 and Gemini 2.5 Flash provide roughly three times the efficiency of both GPT variants. Even the most expensive model requires under $10 for complete 15-iteration optimization, while a shortened 7-iteration run (achieving 90% of attainable alignment) halves this cost. This demonstrates that behavioral alignment with PEVO is not only more accurate but also economically practical for medium-scale crowd simulations.
BibTeX
@inproceedings{wang2025implicit,
title={Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations},
author={Wang, Yunzhe and Lucas, Gale M and Becerik-Gerber, Burcin and Ustun, Volkan},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
publisher={Association for Computational Linguistics}
}