Webhook Retry Mechanism

Overview

We attempt to deliver each webhook message based on a retry schedule with exponential backoff. This ensures that temporary failures (like network issues or brief service outages) don't result in permanently lost webhook events.

Retry Schedule

Each message is attempted based on the following schedule, where each period is started following the failure of the preceding attempt:

  1. Immediately (first attempt)
  2. 5 seconds after first failure
  3. 5 minutes after second failure
  4. 30 minutes after third failure
  5. 2 hours after fourth failure
  6. 5 hours after fifth failure
  7. 10 hours after sixth failure
  8. 10 hours after seventh failure (final attempt)

Example Timeline

An attempt that fails three times before eventually succeeding will be delivered roughly 35 minutes and 5 seconds following the first attempt:

  • T+0: First attempt fails
  • T+5s: Second attempt fails
  • T+5m5s: Third attempt fails
  • T+35m5s: Fourth attempt succeeds ✓

What Triggers a Retry?

A webhook delivery is considered failed and will be retried if:

HTTP Response Codes

  • 4xx errors (except 410 Gone) - Client errors like 400, 404, 429
  • 5xx errors - Server errors like 500, 502, 503, 504
  • Network timeouts - No response within 15 seconds
  • Connection failures - DNS resolution failures, connection refused, etc.

Special Case: 410 Gone

  • 410 Gone responses are treated as permanent failures and will not be retried
  • Use this response code when you want to permanently disable webhook delivery to an endpoint

Success Indicators

A webhook delivery is considered successful when:

  • 2xx status codes (200-299) are returned
  • Response is received within 15 seconds

Important: We interpret any 2xx response as successful delivery, even if your response payload indicates a failure. Make sure to use the correct HTTP status codes to control retry behavior.

Automatic Endpoint Disabling

If all delivery attempts to a specific endpoint fail continuously for 5 consecutive days, the endpoint will be automatically disabled to prevent further failed attempts.

When an endpoint is disabled:

  • ❌ No new webhook deliveries will be attempted
  • ❌ The endpoint will not receive any events until manually re-enabled
  • ✅ You'll be notified about the disabled endpoint
  • ✅ You can re-enable it manually from the dashboard

Re-enabling a Disabled Endpoint

To re-enable a disabled endpoint:

  1. Go to the webhook dashboard
  2. Find the disabled endpoint in the list
  3. Click on the endpoint
  4. Select "Enable Endpoint" from the options menu

Manual Retries and Recovery

Single Message Retry

If you want to replay a specific event:

  1. Find the message in the webhook dashboard UI
  2. Click the options menu (⋯) next to any of the delivery attempts
  3. Click "Resend" to send the same message to your endpoint again

Bulk Recovery Options

Option 1: Recover All Failed Messages Since Date

  1. Go to your endpoint's details page
  2. Click "Options" → "Recover Failed Messages"
  3. Choose a time window to recover from
  4. All failed messages in that timeframe will be retried

Option 2: Recover from Specific Message

  1. Find any message on the endpoint page
  2. Click the options menu (⋯) next to the message
  3. Click "Replay..."
  4. Choose "Replay all failed messages since this time"

This method gives you more granular control over exactly which messages to retry.

Best Practices for Reliable Webhook Handling

1. Return Correct Status Codes

@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
  try {
    await this.processEvent(event);
    
    // Success - return 2xx status
    return { received: true };
    
  } catch (error) {
    if (this.isPermanentError(error)) {
      // Permanent failure - don't retry
      throw new HttpException('Permanent failure', HttpStatus.GONE); // 410
    } else {
      // Temporary failure - allow retries
      throw new HttpException('Temporary failure', HttpStatus.INTERNAL_SERVER_ERROR); // 500
    }
  }
}

private isPermanentError(error: any): boolean {
  // Examples of permanent errors that shouldn't be retried:
  // - Malformed payload that will never be valid
  // - Business logic violations that won't change
  // - Authentication issues that require manual intervention
  return error.code === 'INVALID_PAYLOAD' || 
         error.code === 'AUTHENTICATION_REQUIRED';
}

2. Implement Idempotent Processing

Since webhooks may be retried, ensure your processing is idempotent:

async processChargebackEvent(event: WebhookEvent) {
  // Check if we've already processed this webhook
  const existingRecord = await this.webhookLogRepository.findOne({
    webhookId: event.webhookId
  });
  
  if (existingRecord && existingRecord.status === 'processed') {
    // Already processed successfully - return success
    return { status: 'already_processed' };
  }
  
  // Process the event
  try {
    await this.businessLogic.handleChargeback(event.data);
    
    // Mark as processed
    await this.webhookLogRepository.save({
      webhookId: event.webhookId,
      status: 'processed',
      processedAt: new Date()
    });
    
    return { status: 'processed' };
    
  } catch (error) {
    // Mark as failed for retry
    await this.webhookLogRepository.save({
      webhookId: event.webhookId,
      status: 'failed',
      error: error.message,
      failedAt: new Date()
    });
    
    throw error;
  }
}

3. Implement Graceful Timeout Handling

Ensure your webhook endpoint responds within 15 seconds:

@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
  // Set a timeout slightly less than the 15-second limit
  const timeoutPromise = new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Processing timeout')), 12000)
  );
  
  try {
    // Race between processing and timeout
    const result = await Promise.race([
      this.processWebhookEvent(event),
      timeoutPromise
    ]);
    
    return { received: true };
    
  } catch (error) {
    if (error.message === 'Processing timeout') {
      // Queue for async processing and return success
      await this.webhookQueue.add('process-webhook', event);
      return { received: true, queued: true };
    }
    throw error;
  }
}

4. Monitor and Alert on Failures

Set up monitoring for webhook failures:

@Post('webhooks/chargebacks')
async handleWebhook(@Body() event: WebhookEvent) {
  const startTime = Date.now();
  
  try {
    await this.processWebhookEvent(event);
    
    // Log successful processing
    this.metricsService.incrementCounter('webhook.success', {
      event_type: event.event
    });
    
  } catch (error) {
    // Log failure for monitoring
    this.metricsService.incrementCounter('webhook.failure', {
      event_type: event.event,
      error_type: error.constructor.name
    });
    
    // Alert on critical failures
    if (this.isCriticalError(error)) {
      await this.alertingService.sendAlert({
        type: 'webhook_failure',
        event: event.event,
        error: error.message,
        webhookId: event.webhookId
      });
    }
    
    throw error;
  } finally {
    // Track processing time
    const processingTime = Date.now() - startTime;
    this.metricsService.recordHistogram('webhook.processing_time', processingTime, {
      event_type: event.event
    });
  }
}

Monitoring Webhook Health

Keep track of your webhook endpoint health by monitoring:

  • Success rate - Percentage of webhooks that succeed on first attempt
  • Retry rate - Percentage of webhooks that require retries
  • Average processing time - How long your endpoint takes to respond
  • Error patterns - Common error types and their frequency
  • Endpoint availability - Uptime of your webhook endpoints

Regular monitoring helps you identify and fix issues before they cause webhook endpoints to be disabled.