Troubleshooting & Failure Recovery

Common Webhook Failures

There are some common reasons why your webhook endpoint might be failing. Understanding these issues can help you debug and fix problems quickly.

Most Common Issues

1. Not Using the Raw Payload Body

This is the most common issue. When generating the signed content, we use the raw string body of the message payload.

Problem: If you convert JSON payloads into strings using methods like JSON.stringify(), different implementations may produce different string representations of the JSON object, which can lead to discrepancies when verifying the signature.

Solution: It's crucial to verify the payload exactly as it was sent, byte-for-byte or string-for-string, to ensure accurate verification.

// ❌ WRONG - This will cause signature verification to fail
@Post('webhooks')
async handleWebhook(@Body() body: any, @Headers() headers: any) {
  const payload = JSON.stringify(body); // Don't do this!
  this.verifySignature(payload, headers);
}

// ✅ CORRECT - Use raw body for signature verification
@Post('webhooks')
async handleWebhook(@Req() req: RawBodyRequest<Request>, @Headers() headers: any) {
  const payload = req.rawBody.toString(); // Use raw body
  this.verifySignature(payload, headers);
}

2. Missing or Wrong Secret Key

Problem: Using the incorrect secret key or forgetting to configure it entirely.

Solution: Remember that signing secrets are unique to each endpoint. Double-check your endpoint's signing secret in the webhook dashboard.

// ❌ WRONG - Using a hardcoded or wrong secret
const secret = "wrong-secret-key";

// ✅ CORRECT - Get secret from environment/config
const secret = process.env.WEBHOOK_SECRET; // From your endpoint configuration
if (!secret) {
  throw new Error('WEBHOOK_SECRET environment variable is required');
}

3. Sending the Wrong Response Codes

Problem: When we receive a response with a 2xx status code, we interpret that as a successful delivery even if you indicate a failure in the response payload.

Solution: Make sure to use the correct HTTP response status codes to control retry behavior.

@Post('webhooks')
async handleWebhook(@Body() event: WebhookEvent) {
  try {
    await this.processEvent(event);
    
    // ✅ Success - return 2xx status
    return { received: true };
    
  } catch (error) {
    // ❌ WRONG - Don't return 200 with error in payload
    // return { error: 'Processing failed' }; // Still returns 200!
    
    // ✅ CORRECT - Return appropriate error status codes
    if (error.code === 'PERMANENT_FAILURE') {
      throw new HttpException('Cannot process', HttpStatus.GONE); // 410 - Don't retry
    } else {
      throw new HttpException('Temporary failure', HttpStatus.INTERNAL_SERVER_ERROR); // 500 - Retry
    }
  }
}

4. Response Timeouts

Problem: We will consider any message that fails to send a response within 15 seconds a failed message.

Solution: If your endpoint is processing complicated workflows, it may timeout and result in failed messages. We suggest having your endpoint simply receive the message and add it to a queue to be processed asynchronously so you can respond promptly and avoid getting timed out.

@Post('webhooks')
async handleWebhook(@Body() event: WebhookEvent) {
  try {
    // ✅ CORRECT - Quick validation and queue for async processing
    this.validateEvent(event);
    
    // Add to queue for async processing
    await this.webhookQueue.add('process-chargeback', event, {
      delay: 0,
      attempts: 3,
      backoff: 'exponential'
    });
    
    // Respond immediately
    return { received: true, queued: true };
    
  } catch (error) {
    console.error('Webhook queuing failed:', error);
    throw new HttpException('Failed to queue webhook', HttpStatus.INTERNAL_SERVER_ERROR);
  }
}

Advanced Troubleshooting

Network and Connectivity Issues

DNS Resolution Problems

# Test if your endpoint is accessible
curl -I https://your-endpoint.com/webhooks/chargebacks

# Check DNS resolution
nslookup your-endpoint.com
dig your-endpoint.com

SSL/TLS Certificate Issues

# Check SSL certificate validity
openssl s_client -connect your-endpoint.com:443 -servername your-endpoint.com

# Verify certificate chain
curl -vvI https://your-endpoint.com/webhooks/chargebacks

Firewall and Security Groups

  • Ensure your server accepts incoming connections on the webhook port
  • Check that webhook requests aren't being blocked by firewalls
  • Verify security groups allow inbound HTTPS traffic (port 443)

Payload and Parsing Issues

JSON Parsing Errors

@Post('webhooks')
async handleWebhook(@Req() req: Request) {
  let event: WebhookEvent;
  
  try {
    // Parse JSON safely
    event = typeof req.body === 'string' 
      ? JSON.parse(req.body) 
      : req.body;
  } catch (error) {
    console.error('JSON parsing failed:', error);
    throw new HttpException('Invalid JSON payload', HttpStatus.BAD_REQUEST);
  }
  
  // Validate required fields
  if (!event.event || !event.data || !event.webhookId) {
    throw new HttpException('Missing required fields', HttpStatus.BAD_REQUEST);
  }
  
  // Process the event...
}

Character Encoding Issues

// Ensure proper UTF-8 handling
@Post('webhooks')
async handleWebhook(@Req() req: RawBodyRequest<Request>) {
  // Make sure to use UTF-8 encoding
  const payload = req.rawBody.toString('utf8');
  
  // Process with proper encoding
  await this.processWebhook(payload);
}

Database and Persistence Issues

Connection Pool Exhaustion

@Injectable()
export class WebhookService {
  constructor(
    @InjectRepository(ChargebackEntity)
    private chargebackRepo: Repository<ChargebackEntity>
  ) {}
  
  async processChargeback(data: ChargebackDto) {
    // Use transactions for atomic operations
    return await this.chargebackRepo.manager.transaction(async (manager) => {
      try {
        // Your database operations here
        const chargeback = await manager.save(ChargebackEntity, data);
        
        // Additional operations...
        
        return chargeback;
      } catch (error) {
        // Transaction will be rolled back automatically
        throw error;
      }
    });
  }
}

Deadlock Prevention

// Process webhooks with proper locking to prevent deadlocks
async processChargebackUpdate(chargebackId: string, data: Partial<ChargebackDto>) {
  return await this.chargebackRepo.manager.transaction(async (manager) => {
    // Lock the record to prevent concurrent updates
    const chargeback = await manager
      .createQueryBuilder(ChargebackEntity, 'cb')
      .setLock('pessimistic_write')
      .where('cb.id = :id', { id: chargebackId })
      .getOne();
    
    if (!chargeback) {
      throw new Error(`Chargeback ${chargebackId} not found`);
    }
    
    // Apply updates
    Object.assign(chargeback, data);
    
    return await manager.save(chargeback);
  });
}

Failure Recovery Strategies

Re-enable a Disabled Endpoint

If all attempts to a specific endpoint fail for a period of 5 days, the endpoint will be disabled.

To re-enable a disabled endpoint:

  1. Go to the webhook dashboard
  2. Find the endpoint from the list
  3. Select "Enable Endpoint"

Recovering/Resending Failed Messages

Single Message Recovery

If you want to replay a single event:

  1. Find the message from the UI
  2. Click the options menu next to any of the attempts
  3. Click "resend" to have the same message send to your endpoint again

Bulk Recovery from Service Outage

If you need to recover from a service outage and want to replay all events since a given time:

  1. Go to the Endpoint details page
  2. Click "Options" → "Recover Failed Messages"
  3. Choose a time window to recover from

Granular Recovery

For more granular recovery (e.g., if you know the exact timestamp):

  1. Click the options menu on any message from the endpoint page
  2. Click "Replay..."
  3. Choose "Replay all failed messages since this time"

Emergency Procedures

Complete Service Outage Recovery

// 1. Fix your service issues first
// 2. Test with a single webhook to ensure it's working
// 3. Bulk recover failed messages from the outage period
// 4. Monitor recovery progress

@Injectable()
export class WebhookRecoveryService {
  async handleRecoveryPeriod() {
    // Log recovery start
    this.logger.log('Starting webhook recovery process');
    
    // Temporarily increase processing capacity
    await this.scaleUpProcessingWorkers();
    
    // Monitor recovery progress
    this.startRecoveryMonitoring();
  }
  
  private async scaleUpProcessingWorkers() {
    // Increase worker concurrency during recovery
    this.webhookQueue.concurrency = 10; // Increase from normal 3
  }
  
  private startRecoveryMonitoring() {
    // Monitor recovery metrics
    setInterval(async () => {
      const pendingJobs = await this.webhookQueue.waiting();
      const failedJobs = await this.webhookQueue.failed();
      
      this.logger.log(`Recovery progress: ${pendingJobs} pending, ${failedJobs} failed`);
      
      // Alert if recovery is stalling
      if (pendingJobs > 1000) {
        await this.alertingService.sendAlert({
          type: 'recovery_stalling',
          pendingJobs,
          failedJobs
        });
      }
    }, 30000); // Check every 30 seconds
  }
}

Debugging Checklist

When troubleshooting webhook failures, work through this checklist:

✅ Basic Connectivity

  • Endpoint URL is accessible via HTTPS
  • DNS resolves correctly
  • SSL certificate is valid and not expired
  • Firewall allows inbound HTTPS traffic
  • Server is running and responding to requests

✅ Request Handling

  • Endpoint accepts POST requests
  • Content-Type: application/json is handled correctly
  • Raw request body is preserved for signature verification
  • Request processing completes within 15 seconds
  • Proper HTTP status codes are returned

✅ Signature Verification

  • Webhook signing secret is correctly configured
  • Using the raw request body (not re-stringified JSON)
  • Headers are being read correctly (webhook-id, webhook-timestamp, webhook-signature)
  • Signature verification logic is implemented correctly
  • Timestamp tolerance allows for reasonable clock skew

✅ Error Handling

  • Transient errors return 5xx status codes (for retries)
  • Permanent errors return 410 Gone (to stop retries)
  • Success cases return 2xx status codes
  • Proper logging for debugging failed requests

✅ Application Logic

  • Webhook events are processed idempotently
  • Database operations are atomic and handle concurrency
  • Long-running operations are queued for async processing
  • Memory and resource usage are within limits

Getting Help

If you're still experiencing issues after working through this troubleshooting guide:

  1. Check the webhook dashboard for detailed error logs and retry information
  2. Review your application logs for any error messages or stack traces
  3. Test with the webhook testing feature to isolate the issue
  4. Verify your implementation matches the code examples in this documentation

The webhook dashboard provides detailed logs, delivery attempts, and error messages that can help you pinpoint exactly where the failure is occurring.