Webhook Retry Mechanism
Overview
We attempt to deliver each webhook message based on a retry schedule with exponential backoff. This ensures that temporary failures (like network issues or brief service outages) don't result in permanently lost webhook events.
Retry Schedule
Each message is attempted based on the following schedule, where each period is started following the failure of the preceding attempt:
- Immediately (first attempt)
- 5 seconds after first failure
- 5 minutes after second failure
- 30 minutes after third failure
- 2 hours after fourth failure
- 5 hours after fifth failure
- 10 hours after sixth failure
- 10 hours after seventh failure (final attempt)
Example Timeline
An attempt that fails three times before eventually succeeding will be delivered roughly 35 minutes and 5 seconds following the first attempt:
- T+0: First attempt fails
- T+5s: Second attempt fails
- T+5m5s: Third attempt fails
- T+35m5s: Fourth attempt succeeds ✓
What Triggers a Retry?
A webhook delivery is considered failed and will be retried if:
HTTP Response Codes
- 4xx errors (except 410 Gone) - Client errors like 400, 404, 429
- 5xx errors - Server errors like 500, 502, 503, 504
- Network timeouts - No response within 15 seconds
- Connection failures - DNS resolution failures, connection refused, etc.
Special Case: 410 Gone
- 410 Gone responses are treated as permanent failures and will not be retried
- Use this response code when you want to permanently disable webhook delivery to an endpoint
Success Indicators
A webhook delivery is considered successful when:
- 2xx status codes (200-299) are returned
- Response is received within 15 seconds
Important: We interpret any 2xx response as successful delivery, even if your response payload indicates a failure. Make sure to use the correct HTTP status codes to control retry behavior.
Automatic Endpoint Disabling
If all delivery attempts to a specific endpoint fail continuously for 5 consecutive days, the endpoint will be automatically disabled to prevent further failed attempts.
When an endpoint is disabled:
- ❌ No new webhook deliveries will be attempted
- ❌ The endpoint will not receive any events until manually re-enabled
- ✅ You'll be notified about the disabled endpoint
- ✅ You can re-enable it manually from the dashboard
Re-enabling a Disabled Endpoint
To re-enable a disabled endpoint:
- Go to the webhook dashboard
- Find the disabled endpoint in the list
- Click on the endpoint
- Select "Enable Endpoint" from the options menu
Manual Retries and Recovery
Single Message Retry
If you want to replay a specific event:
- Find the message in the webhook dashboard UI
- Click the options menu (⋯) next to any of the delivery attempts
- Click "Resend" to send the same message to your endpoint again
Bulk Recovery Options
Option 1: Recover All Failed Messages Since Date
- Go to your endpoint's details page
- Click "Options" → "Recover Failed Messages"
- Choose a time window to recover from
- All failed messages in that timeframe will be retried
Option 2: Recover from Specific Message
- Find any message on the endpoint page
- Click the options menu (⋯) next to the message
- Click "Replay..."
- Choose "Replay all failed messages since this time"
This method gives you more granular control over exactly which messages to retry.
Best Practices for Reliable Webhook Handling
1. Return Correct Status Codes
@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
try {
await this.processEvent(event);
// Success - return 2xx status
return { received: true };
} catch (error) {
if (this.isPermanentError(error)) {
// Permanent failure - don't retry
throw new HttpException('Permanent failure', HttpStatus.GONE); // 410
} else {
// Temporary failure - allow retries
throw new HttpException('Temporary failure', HttpStatus.INTERNAL_SERVER_ERROR); // 500
}
}
}
private isPermanentError(error: any): boolean {
// Examples of permanent errors that shouldn't be retried:
// - Malformed payload that will never be valid
// - Business logic violations that won't change
// - Authentication issues that require manual intervention
return error.code === 'INVALID_PAYLOAD' ||
error.code === 'AUTHENTICATION_REQUIRED';
}
2. Implement Idempotent Processing
Since webhooks may be retried, ensure your processing is idempotent:
async processChargebackEvent(event: WebhookEvent) {
// Check if we've already processed this webhook
const existingRecord = await this.webhookLogRepository.findOne({
webhookId: event.webhookId
});
if (existingRecord && existingRecord.status === 'processed') {
// Already processed successfully - return success
return { status: 'already_processed' };
}
// Process the event
try {
await this.businessLogic.handleChargeback(event.data);
// Mark as processed
await this.webhookLogRepository.save({
webhookId: event.webhookId,
status: 'processed',
processedAt: new Date()
});
return { status: 'processed' };
} catch (error) {
// Mark as failed for retry
await this.webhookLogRepository.save({
webhookId: event.webhookId,
status: 'failed',
error: error.message,
failedAt: new Date()
});
throw error;
}
}
3. Implement Graceful Timeout Handling
Ensure your webhook endpoint responds within 15 seconds:
@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
// Set a timeout slightly less than the 15-second limit
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('Processing timeout')), 12000)
);
try {
// Race between processing and timeout
const result = await Promise.race([
this.processWebhookEvent(event),
timeoutPromise
]);
return { received: true };
} catch (error) {
if (error.message === 'Processing timeout') {
// Queue for async processing and return success
await this.webhookQueue.add('process-webhook', event);
return { received: true, queued: true };
}
throw error;
}
}
4. Monitor and Alert on Failures
Set up monitoring for webhook failures:
@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
const startTime = Date.now();
try {
await this.processWebhookEvent(event);
// Log successful processing
this.metricsService.incrementCounter('webhook.success', {
event_type: event.event
});
} catch (error) {
// Log failure for monitoring
this.metricsService.incrementCounter('webhook.failure', {
event_type: event.event,
error_type: error.constructor.name
});
// Alert on critical failures
if (this.isCriticalError(error)) {
await this.alertingService.sendAlert({
type: 'webhook_failure',
event: event.event,
error: error.message,
webhookId: event.webhookId
});
}
throw error;
} finally {
// Track processing time
const processingTime = Date.now() - startTime;
this.metricsService.recordHistogram('webhook.processing_time', processingTime, {
event_type: event.event
});
}
}
Monitoring Webhook Health
Keep track of your webhook endpoint health by monitoring:
- Success rate - Percentage of webhooks that succeed on first attempt
- Retry rate - Percentage of webhooks that require retries
- Average processing time - How long your endpoint takes to respond
- Error patterns - Common error types and their frequency
- Endpoint availability - Uptime of your webhook endpoints
Regular monitoring helps you identify and fix issues before they cause webhook endpoints to be disabled.
Updated 2 days ago