Technical overview

Accelerated Prompt Stress Testing (APST)

A depth-oriented framework for evaluating LLM safety and reliability under repeated inference. APST moves beyond single-shot benchmarking to measure how models behave when queried repeatedly, retried, and embedded into stochastic execution paths.

Abstract

Large Language Models (LLMs) exhibit stochastic behavior that single-shot evaluation protocols fail to capture. Accelerated Prompt Stress Testing (APST) is a framework for estimating empirical failure probability under repeated inference across production-representative prompt distributions. By stress-testing models at deployment depth rather than benchmark breadth, APST surfaces failure hotspots that standard evaluations miss. The framework has been extended with graph-guided discovery (APST-G) for semantic neighborhood exploration and prompt optimization (APST-PO) for generating safer system prompts and workflow policies.

Core methodology

01Profile reliability on a representative prompt distribution drawn from production traffic or target use cases.
02Execute repeated inference with controlled temperature, retry, and sampling parameters to model deployment-stochastic behavior.
03Judge outputs against safety and correctness criteria to compute empirical failure probability (p_fail).
04Apply APST technology: graph-guided exploration (APST-G) discovers semantic neighborhoods where failures cluster, and prompt optimization (APST-PO) generates safer system prompts and policy constraints that reduce measured risk.
05Validate reduction by re-running APST against the same distribution with optimized configurations.

References

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Keita Broadwater. arXiv:2602.11786, 2026.

arXiv

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

Keita Broadwater. CCAI 2026. arXiv:2604.09606.

arXiv

Key metrics

p_fail = failures / valid judged generations

Empirical failure probability

Δp = p_fail_baseline − p_fail_optimized

Validated reduction

Expected failures = p_fail × n

Deployment-scale projection

SafeFlow does not guarantee safety. APST estimates, measures, and projects LLM reliability to inform deployment decisions.