跳到主要内容

LLM Error Handling & Retry: Best Practices Design

Scope: Structured error classification, automatic retry mechanisms, fallback strategies, and recovery patterns for LLM client libraries and agent frameworks.

Synthesized from: pydantic-ai, langchain, pi-mono, kosong, republic


Core Philosophy

Errors are data. Recovery is strategy. Decisions are context-dependent.

This design philosophy combines:

  • Type safety (compile-time) for precise error handling
  • Strategy flexibility (runtime) for adaptable recovery
  • Observability for production debugging
  • Testability for chaos engineering

1. Dual-Layer Error System

1.1 Type Layer: Precise Error Types

/// Hierarchical error types for match-based handling
pub enum LLMError {
/// Developer misuse (bad API key, invalid model name)
User {
kind: UserErrorKind,
message: String,
},

/// Runtime errors during LLM interaction
Runtime(RuntimeError),

/// Wrapped error with recovery strategy attached
Retryable {
source: Box<LLMError>,
strategy: RetryStrategy,
},
}

pub enum RuntimeError {
Connection {
endpoint: String,
source: Option<Box<dyn std::error::Error>>,
},
Status {
code: u16,
body: Option<String>,
provider: ProviderId,
},
Validation {
field: String,
reason: String,
},
TokenLimit {
requested: usize,
max_tokens: Option<usize>,
},
ContentFilter {
provider: ProviderId,
reason: Option<String>,
},
ToolCallIncomplete {
partial: ToolCall,
},
}

pub enum UserErrorKind {
InvalidApiKey,
ModelNotFound,
InvalidParameter,
UnsupportedFeature,
}

Design Rationale:

  • Explicit types enable exhaustive match handling
  • Rich context (status codes, provider IDs) aids debugging
  • Separate user errors from runtime errors for different handling paths

1.2 Classification Layer: Strategy-Driven Categories

/// Error classifications for recovery strategy selection
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ErrorClass {
/// Unrecoverable, abort immediately
Fatal,

/// Temporary failure, retry with same model (429, 408, 503)
Transient,

/// Configuration issue, abort and alert (401, 403)
Config,

/// Provider issue, may fallback or retry (5xx, timeout)
Switchable,

/// Token limit, special compaction handling
Capacity,

/// Content policy violation
Policy,
}

/// Trait for classifying errors into strategy categories
pub trait ErrorClassifier: Send + Sync {
fn classify(&self, error: &LLMError) -> ErrorClass;
}

Design Rationale:

  • Classification separates "what happened" from "what to do"
  • Enables user-injectable policy (business-specific rules)
  • Simplifies decision logic by working with enums instead of types

2. Multi-Level Classification Pipeline

┌─────────────────────────────────────────────────────────────┐
│ Level 1: User Classifier (highest priority) │
│ - Business-specific mappings │
│ - Provider-specific quirks │
├─────────────────────────────────────────────────────────────┤
│ Level 2: Library Exception Mapping │
│ - openai::APIStatusError → Status { code } │
│ - anthropic::RateLimitError → Transient │
│ - any_llm::AnyLLMError → mapped variants │
├─────────────────────────────────────────────────────────────┤
│ Level 3: HTTP Status Classification │
│ - 429 / 408 → Transient (rate limit / timeout) │
│ - 401 / 403 → Config (auth / permission) │
│ - 5xx → Switchable (server errors) │
├─────────────────────────────────────────────────────────────┤
│ Level 4: Text Signature Matching │
│ - Regex patterns for non-standard providers │
│ - "rate limit" / "too many requests" / "quota exceeded" │
└─────────────────────────────────────────────────────────────┘
pub struct TieredClassifier {
user: Option<Box<dyn ErrorClassifier>>,
http: HttpStatusClassifier,
text: TextSignatureClassifier,
}

impl ErrorClassifier for TieredClassifier {
fn classify(&self, error: &LLMError) -> ErrorClass {
// Level 1: User-defined rules
if let Some(user) = &self.user {
let class = user.classify(error);
if class != ErrorClass::Unknown {
return class;
}
}

// Level 2-4: Built-in classifiers...
self.http.classify(error)
.or_else(|| self.text.classify(error))
.unwrap_or(ErrorClass::Fatal)
}
}

3. Recoverability as a Trait

/// Recoverable errors can provide recovery strategies
pub trait Recoverable: std::error::Error {
/// Determine the recovery strategy for this error
fn recovery_strategy(&self, ctx: &RecoveryContext) -> RecoveryStrategy;

/// Whether this error can be fed back to LLM for correction
/// (inspired by langchain's send_to_llm)
fn is_llm_fixable(&self) -> bool {
false
}

/// Get suggested fixes for user-facing display
fn suggestions(&self) -> Vec<String> {
vec![]
}
}

pub struct RecoveryContext {
pub attempt_count: u32,
pub fallback_available: bool,
pub max_retries: u32,
pub elapsed: Duration,
}

pub enum RecoveryStrategy {
/// Retry with exponential backoff
Retry {
backoff: BackoffConfig,
max_attempts: u32,
},

/// Switch to fallback model
Fallback {
target: ModelId,
propagate_error: bool, // langchain's exception_key pattern
},

/// Compact context and retry (token limit special case)
Compaction {
strategy: CompactionStrategy,
},

/// Delegate to external handler (human approval, etc.)
Delegate {
handler: HandlerId,
timeout: Duration,
},

/// Abort the operation
Abort {
reason: AbortReason,
},
}

4. Pluggable Backoff Strategies

pub trait BackoffStrategy: Send + Sync {
/// Calculate next delay, return None to stop retrying
fn next_delay(&self, ctx: &RetryContext) -> Option<Duration>;
}

pub struct RetryContext {
pub attempt: u32,
pub error: &LLMError,
pub last_delay: Option<Duration>,
pub server_requested_delay: Option<Duration>,
}

/// Fixed interval backoff
pub struct FixedBackoff {
pub delay: Duration,
}

/// Exponential backoff with optional jitter
pub struct ExponentialBackoff {
pub initial: Duration,
pub multiplier: f64,
pub max_delay: Duration,
pub jitter: JitterMode,
}

pub enum JitterMode {
None,
Full, // Random [0, calculated]
Equal, // Random [calculated/2, calculated]
Decorrelated, // max(min_delay, random * last_delay * 3)
}

/// Respect server's Retry-After header (inspired by pi-mono + pydantic-ai)
pub struct RespectRetryAfter {
pub fallback: Box<dyn BackoffStrategy>,
pub max_delay: Duration,
pub respect_header: bool, // false = use fallback only
}

impl BackoffStrategy for RespectRetryAfter {
fn next_delay(&self, ctx: &RetryContext) -> Option<Duration> {
// Priority: server_requested_delay > fallback calculation
let delay = ctx.server_requested_delay
.and_then(|d| if self.respect_header { Some(d) } else { None })
.or_else(|| self.fallback.next_delay(ctx))?
.min(self.max_delay);

Some(delay)
}
}

Retry-After Extraction (comprehensive)

pub fn extract_retry_delay(error: &LLMError, headers: &HeaderMap) -> Option<Duration> {
// 1. Standard Retry-After header (seconds or HTTP date)
if let Some(value) = headers.get("retry-after") {
if let Ok(text) = value.to_str() {
// Try parsing as integer seconds
if let Ok(seconds) = text.parse::<u64>() {
return Some(Duration::from_secs(seconds));
}
// Try parsing as HTTP date
if let Ok(date) = parse_http_date(text) {
return Some(date - SystemTime::now());
}
}
}

// 2. Provider-specific headers
if let Some(value) = headers.get("x-ratelimit-reset") {
// Unix timestamp
if let Ok(ts) = value.to_str().and_then(|s| s.parse::<u64>().ok()) {
let reset_time = SystemTime::UNIX_EPOCH + Duration::from_secs(ts);
return reset_time.duration_since(SystemTime::now()).ok();
}
}

if let Some(value) = headers.get("x-ratelimit-reset-after") {
// Seconds from now
if let Ok(seconds) = value.to_str().and_then(|s| s.parse::<u64>().ok()) {
return Some(Duration::from_secs(seconds));
}
}

// 3. Error message pattern matching (pi-mono approach)
if let Some(text) = error.error_message() {
// "Your quota will reset after 18h31m10s"
if let Some(caps) = RE_RESET_DURATION.captures(text) {
return Some(parse_duration(&caps));
}
// "Please retry in 2s"
if let Some(caps) = RE_RETRY_IN.captures(text) {
return Some(parse_duration(&caps));
}
// "retryDelay": "34.074s" (JSON)
if let Some(caps) = RE_JSON_RETRY_DELAY.captures(text) {
return Some(parse_duration(&caps));
}
}

None
}

5. Decision Engine

/// Central decision logic for error recovery
pub struct DecisionEngine {
classifier: Box<dyn ErrorClassifier>,
max_retries: u32,
backoff: Box<dyn BackoffStrategy>,
fallbacks: Vec<ModelId>,
}

pub enum Decision {
Retry { delay: Duration },
Fallback { target: ModelId, carry_error: bool },
Compact { strategy: CompactionStrategy },
Abort { reason: AbortReason },
}

impl DecisionEngine {
pub fn decide(&self, error: &LLMError, ctx: &ExecutionContext) -> Decision {
let class = self.classifier.classify(error);
let attempts = ctx.current_attempts();
let has_fallback = !self.fallbacks.is_empty() && ctx.fallback_index() < self.fallbacks.len();

match class {
ErrorClass::Fatal => Decision::Abort {
reason: AbortReason::FatalError
},

ErrorClass::Config => Decision::Abort {
reason: AbortReason::Configuration
},

ErrorClass::Capacity => Decision::Compact {
strategy: CompactionStrategy::SummarizeOldest
},

ErrorClass::Policy => Decision::Abort {
reason: AbortReason::ContentPolicy
},

ErrorClass::Transient if attempts < self.max_retries => {
let retry_ctx = RetryContext {
attempt: attempts,
error,
last_delay: ctx.last_delay(),
server_requested_delay: ctx.server_requested_delay(),
};
match self.backoff.next_delay(&retry_ctx) {
Some(delay) => Decision::Retry { delay },
None => Decision::Abort { reason: AbortReason::BackoffExhausted },
}
}

ErrorClass::Transient | ErrorClass::Switchable if has_fallback => {
Decision::Fallback {
target: self.fallbacks[ctx.fallback_index()].clone(),
carry_error: true, // Allow fallback to see the error
}
}

_ => Decision::Abort { reason: AbortReason::Exhausted },
}
}
}

6. Callback & Observability System

/// Callback trait for observing error handling (inspired by langchain)
pub trait CallbackHandler: Send + Sync {
fn on_llm_start(&self, model: &ModelId, request: &Request);
fn on_llm_end(&self, model: &ModelId, response: &Response);

fn on_llm_error(&self, model: &ModelId, error: &LLMError);

fn on_retry(&self,
model: &ModelId,
error: &LLMError,
attempt: u32,
next_delay: Duration
);

fn on_fallback(&self,
from: &ModelId,
to: &ModelId,
error: &LLMError
);

fn on_compaction(&self,
strategy: &CompactionStrategy,
tokens_removed: usize
);
}

/// Tracing integration
#[derive(Debug)]
pub struct RetryEvent {
pub run_id: Uuid,
pub parent_run_id: Option<Uuid>,
pub model: ModelId,
pub attempt: u32,
pub error_class: ErrorClass,
pub delay_ms: u64,
pub timestamp: SystemTime,
}

7. Context Overflow Handling

/// Specialized handling for token limit errors (inspired by pydantic-ai)
pub struct OverflowHandler {
pub patterns: Vec<Regex>, // Provider-specific error patterns
}

impl OverflowHandler {
/// Detect if error is a context overflow
pub fn is_overflow(&self, error: &LLMError) -> bool {
let message = match error {
LLMError::Runtime(RuntimeError::TokenLimit { .. }) => return true,
_ => error.to_string(),
};

// Pattern matching for various providers
self.patterns.iter().any(|re| re.is_match(&message))
}

/// Default patterns (pi-mono's comprehensive list)
pub fn default_patterns() -> Vec<Regex> {
vec![
r"prompt is too long", // Anthropic
r"input is too long for requested model", // Amazon Bedrock
r"exceeds the context window", // OpenAI
r"input token count.*exceeds the maximum", // Google
r"maximum prompt length is \d+", // xAI
r"reduce the length of the messages", // Groq
r"maximum context length is \d+ tokens", // OpenRouter
r"exceeded model token limit", // Kimi
r"context[_ ]length[_ ]exceeded", // Generic
].into_iter()
.map(|p| Regex::new(p).unwrap())
.collect()
}
}

/// Compaction strategies
pub enum CompactionStrategy {
/// Remove oldest messages
DropOldest { keep_recent: usize },

/// Summarize oldest messages
SummarizeOldest,

/// Compress via summary model
CompressWithModel { model: ModelId },

/// User-defined strategy
Custom(Box<dyn CompactionFn>),
}

8. Testing & Chaos Engineering

/// Chaos testing configuration (inspired by kosong)
pub struct ChaosConfig {
pub error_probability: f64,
pub error_types: Vec<InjectedError>,
pub latency_mean: Option<Duration>,
pub latency_stddev: Option<Duration>,
}

pub enum InjectedError {
Status { code: u16, body: String },
Timeout,
ConnectionReset,
CorruptResponse,
}

/// Test helper for simulating error scenarios
pub struct ChaosProvider<P: LLMProvider> {
inner: P,
config: ChaosConfig,
rng: ThreadRng,
}

impl<P: LLMProvider> LLMProvider for ChaosProvider<P> {
async fn complete(&self, request: Request) -> Result<Response, LLMError> {
// Inject latency
if let Some(mean) = self.config.latency_mean {
let jitter = self.config.latency_stddev
.map(|s| s.as_millis() as f64 * self.rng.sample::<f64, _>(StandardNormal))
.unwrap_or(0.0) as u64;
sleep(Duration::from_millis(mean.as_millis() as u64 + jitter)).await;
}

// Inject error
if self.rng.gen::<f64>() < self.config.error_probability {
return Err(self.generate_error());
}

self.inner.complete(request).await
}
}

9. Complete Configuration

pub struct ErrorHandlingConfig {
/// Maximum retries per model
pub max_retries: u32,

/// Backoff strategy
pub backoff: Box<dyn BackoffStrategy>,

/// Maximum delay to wait (pi-mono's maxRetryDelayMs)
pub max_retry_delay: Duration,

/// Fallback model chain
pub fallbacks: Vec<ModelId>,

/// Whether to pass errors to fallbacks (langchain pattern)
pub propagate_errors_to_fallbacks: bool,

/// Custom classifier
pub classifier: Option<Box<dyn ErrorClassifier>>,

/// Callback handlers
pub callbacks: Vec<Arc<dyn CallbackHandler>>,

/// Token limit handling
pub compaction: Option<CompactionConfig>,

/// Whether to feed errors back to LLM for correction
pub enable_llm_error_recovery: bool,
}

impl Default for ErrorHandlingConfig {
fn default() -> Self {
Self {
max_retries: 3,
backoff: Box::new(ExponentialBackoff {
initial: Duration::from_secs(1),
multiplier: 2.0,
max_delay: Duration::from_secs(60),
jitter: JitterMode::Decorrelated,
}),
max_retry_delay: Duration::from_secs(60),
fallbacks: vec![],
propagate_errors_to_fallbacks: true,
classifier: None,
callbacks: vec![],
compaction: None,
enable_llm_error_recovery: false,
}
}
}

10. Summary: Design Decision Matrix

DecisionRecommendationPrimary Source
Error TypesHierarchical enum with rich contextpydantic-ai
ClassificationSeparate ErrorClass enumrepublic
Classifier Architecture4-tier pipeline (user → library → HTTP → text)republic + pi-mono
RecoverabilityTrait-based, context-awarekosong + pydantic-ai
BackoffPluggable with Retry-After supportpi-mono + pydantic-ai
Decision LogicCentralized DecisionEnginerepublic
FallbackModel chain with error propagationlangchain
Token LimitSpecial Compaction strategypydantic-ai
ObservabilityCallback system + structured eventslangchain
TestingBuilt-in chaos injectionkosong

  • /Users/dylan/DylanLi/repo/agent-group/infra-LLM/learns/structured-errors-retry.md - Comparative analysis of all frameworks
  • /Users/dylan/DylanLi/repo/agent-group/infra-LLM/pydantic-ai/pydantic_ai_slim/pydantic_ai/exceptions.py
  • /Users/dylan/DylanLi/repo/agent-group/infra-LLM/republic/src/republic/core/execution.py
  • /Users/dylan/DylanLi/repo/agent-group/infra-LLM/pi-mono/packages/ai/src/providers/google-gemini-cli.ts
  • /Users/dylan/DylanLi/repo/agent-group/infra-LLM/kimi-cli/packages/kosong/src/kosong/chat_provider/chaos.py

Created: 2026-02-26