Abstract:
Nonrecurrent traffic disruptions, such as crashes, work zones, adverse weather, and special events, account for a large share of delay and are precisely when prediction errors are most costly. Yet most traffic forecasting benchmarks report aggregate errors over all times and locations, which can systematically hide failures during disruptions. This article explains why aggregate evaluation often overstates robustness under nonrecurrent conditions and identifies five masking mechanisms, including label dilution, temporal averaging, spatial pooling, feature availability assumptions, and heterogeneous ground truths. We synthesize current practices into a minimum reporting bar for nonrecurrent evaluation and propose a taxonomy of benchmark design routes that clarifies what each evaluation can and cannot claim about deployment readiness. The result is an actionable checklist for researchers, practitioners, and reviewers to interpret published results and design benchmarks that better reflect operational risk.