Recent work has shown that contemporary machine learning models are capable of highly human-like performance on a variety of real-world tasks, including complex multi-agent simulations. However, three crucial questions remain unanswered: First, what circumstances might cause the deployment of AI models in these settings to go wrong? Second, how can we make dependable models that are robust to these failures? Third, how can we evaluate models based on this understanding? This paper proposes a novel approach to answering these questions for a realistic multi-agent environment by benchmarking high-fidelity AI models trained on open data. Drawing on prior work in adversarial robustness, we provide a method to both simulate such complex benchmark failures and verify that improved models are more robust. Our model-independent adversarial simulation method, Verification Exploratory Adversarial Disturbances (V.E.A.D.), enriches the toolbox available for robustness evaluation across many application areas and exposes AI system vulnerabilities that are not uncovered by current methods. We additionally find that incorporating robust training in imitation learning objectives can incentivize models to improve and tackle situations with increasing complexity, such as when the number of interactive agents within the road environment is gradually increased. We hope that promoting research into active verification will push future AI systems to not just answer the standard difficult questions, but also provide answers we can depend on.