This document covers debugging techniques for complex multi-service systems including log analysis across distributed services, identifying service dependencies, rollback strategies, load balancer troubleshooting, and infrastructure management for cloud-based applications. Distributed system debugging strategies.
This document explores debugging techniques for complex distributed systems involving multiple services, covering systematic log analysis across service boundaries, identifying what changed between working and failing states, rollback strategies, load balancer troubleshooting, removing faulty servers from pools, and managing cloud-based infrastructure with resource limits and automated deployment pipelines.
Troubleshooting problems on a single computer differs significantly from debugging complex systems with many interacting services. When multiple computers and services work together to provide functionality, problems can arise from any component or their interactions. Effective debugging requires understanding the bigger picture, analyzing logs across services, identifying changes, and managing infrastructure at scale.
Complex systems consist of multiple interconnected services working together to provide functionality. Examples include:
| Layer | Components | Purpose |
|---|---|---|
| Front-end | Web servers, application servers | User interface and initial request handling |
| Load balancers | Traffic distribution, health checks | Distribute requests across servers |
| Application layer | Business logic services | Process requests and implement functionality |
| Back-end services | Authentication, inventory, billing, procurement | Specialized functionality |
| Data layer | Databases, caches, storage | Data persistence and retrieval |
| External services | APIs, third-party integrations | Extended functionality |
Complex systems have interdependencies where one service relies on others:
1User Request
2 ↓
3Load Balancer
4 ↓
5Front-end Server → Authentication Service
6 ↓ ↓
7Back-end Server → Database
8 ↓
9Inventory Service → Cache
10 ↓
11Billing Service → External Payment API
Failure in any component can cascade through the system.
An e-commerce site recently started responding with internal server error (HTTP 500) to approximately 20% of all requests. Users experience intermittent failures when browsing products or checking out.
| Observation | Details |
|---|---|
| Error rate | 20% of requests |
| Error type | Internal server error (500) |
| Onset | Recent (no specific timing initially known) |
| Affected users | Random subset |
| Pattern | Intermittent, not consistent |
Apply the same troubleshooting principles used for single-computer problems, but at larger scale:
| Log Type | Location | Information Provided |
|---|---|---|
| Service-specific logs | Application log directory | Service-level errors and warnings |
| System logs | /var/log/syslog, /var/log/messages | General system problems |
| Web server logs | /var/log/apache2/, /var/log/nginx/ | HTTP errors, access patterns |
| Database logs | Database-specific locations | Query errors, connection issues |
| Load balancer logs | Load balancer configuration | Traffic distribution, health checks |
For the e-commerce example, examine logs for the failing service:
1tail -n 1000 /var/log/ecommerce/app.log | grep ERROR
Output reveals:
12025-11-13 10:23:15 ERROR: invalid response from server
22025-11-13 10:23:18 ERROR: invalid response from server
32025-11-13 10:23:22 ERROR: invalid response from server
The error message “invalid response from server” indicates:
Note
Vague error messages like “invalid response” are common but unhelpful. When debugging, note these for improvement to include specific details about what was invalid.
When service was working correctly:
| Time Period | Status |
|---|---|
| Previous week | Service deployed, functioning normally |
| Monday | No issues reported |
| Tuesday morning | Errors begin appearing |
| Change Type | Examples | Investigation Method |
|---|---|---|
| Code deployments | New application versions | Check deployment logs, version control |
| Configuration changes | Settings, feature flags | Review configuration management |
| Infrastructure changes | Server additions, network changes | Check infrastructure logs |
| Dependency updates | Library versions, OS patches | Review package management logs |
| External service changes | API updates, third-party services | Check service provider announcements |
For the e-commerce example:
1# Check recent deployments
2git log --since="1 week ago" --oneline
3
4# Check configuration changes
5git diff HEAD~7 config/
6
7# Check infrastructure changes
8tail -n 500 /var/log/infrastructure-changes.log
Findings:
Expanding investigation to dependencies:
| System | Change Date | Details |
|---|---|---|
| Database | No recent changes | Last update 2 weeks ago |
| Authentication service | No recent changes | Stable |
| Inventory system | No recent changes | Stable |
| Load balancer | Tuesday morning | Multiple configuration changes |
Load balancer changes are suspicious because:
Rollback is appropriate when:
| Benefit | Description |
|---|---|
| Immediate resolution | Restores service if change was the cause |
| Eliminates suspects | Rules out change if problem persists |
| Buys investigation time | Allows thorough debugging without user impact |
| Reduces risk | Prevents further degradation |
1# Check current load balancer configuration version
2load-balancer-cli --show-version
3# Output: v2.3.5 (deployed 2025-11-13 08:00)
4
5# View previous configuration
6load-balancer-cli --show-config --version v2.3.4
7
8# Rollback to previous version
9load-balancer-cli --rollback --version v2.3.4
10
11# Monitor error rate
12watch -n 5 "tail -n 100 /var/log/ecommerce/app.log | grep -c ERROR"
| Scenario | Action | Reason |
|---|---|---|
| Rollback fixes issue | Investigate change that caused problem | Identified root cause |
| Rollback doesn’t help | Continue investigating other components | Eliminated one suspect |
| Can’t rollback easily | Investigate before making changes | Avoid making situation worse |
Important
Even if not 100% certain a change caused the issue, rollback should be attempted when possible. It either fixes the problem immediately or eliminates a suspect, both valuable outcomes.
After rollback resolves the issue, examine what specifically caused the problem:
1# Compare configurations
2diff /etc/load-balancer/config.v2.3.4 /etc/load-balancer/config.v2.3.5
Difference found:
1+ backend_server pool inventory {
2+ server 192.168.1.45:8080;
3+ }
Problem identified:
192.168.1.45 was added to inventory system poolOriginal error:
1ERROR: invalid response from server
What it should have included:
1ERROR: invalid response from inventory service
2Request: GET /api/inventory/items/12345
3Expected: 200 OK with JSON inventory data
4Received: 404 Not Found
5Server: 192.168.1.45:8080
6Reason: Expected inventory response, got 404 error page
Better error messages should include:
| Information | Purpose |
|---|---|
| Service being called | Identifies which dependency failed |
| Request details | Shows what was attempted |
| Expected response | Clarifies what should have happened |
| Actual response | Shows what actually happened |
| Server address | Helps identify specific server issues |
| Reason for invalidity | Explains why response was rejected |
Two weeks later, internal server errors appear again in the same e-commerce service.
Warning
While tempting to assume the load balancer is at fault again, each incident should be investigated independently. Same symptom doesn’t guarantee same cause.
Checking logs with improved error messages:
1grep ERROR /var/log/ecommerce/app.log | tail -n 50
Output:
12025-11-27 14:15:33 ERROR: database connection timeout
2Server: db-server-03 (192.168.2.33)
3Query: SELECT * FROM products WHERE id = ?
4Timeout: 30 seconds elapsed
5Connection pool: 95/100 connections in use
New finding: Only one front-end server (frontend-02) shows errors.
1# Check which servers are experiencing errors
2for server in frontend-01 frontend-02 frontend-03 frontend-04; do
3 echo "$server:"
4 ssh $server "grep -c ERROR /var/log/ecommerce/app.log | tail -n 100"
5done
Output:
1frontend-01: 0
2frontend-02: 47
3frontend-03: 0
4frontend-04: 0
Only frontend-02 is affected.
When a specific server is identified as problematic:
1# Remove server from active pool
2load-balancer-cli --remove-server frontend-02 --pool ecommerce-frontend
3
4# Verify removal
5load-balancer-cli --list-servers --pool ecommerce-frontend
Output:
1Active servers in ecommerce-frontend pool:
2- frontend-01 (192.168.1.21) - healthy
3- frontend-03 (192.168.1.23) - healthy
4- frontend-04 (192.168.1.24) - healthy
5
6Removed servers:
7- frontend-02 (192.168.1.22) - removed for investigation
| Benefit | Description |
|---|---|
| User protection | Prevents users from hitting faulty server |
| Service continuity | Other servers continue serving traffic |
| Investigation time | Allows thorough diagnosis without urgency |
| Testing safety | Can test fixes without affecting users |
With server removed from pool, investigate safely:
1# SSH to problem server
2ssh frontend-02
3
4# Check system resources
5top
6df -h
7free -m
8
9# Check database connections
10netstat -an | grep :3306 | wc -l
11
12# Check application logs
13tail -n 500 /var/log/ecommerce/app.log
Diagnosis: Database connection pool exhausted on this server due to connection leak.
| Logging Feature | Purpose | Implementation |
|---|---|---|
| Centralized logging | Aggregate logs from all services | ELK Stack, Splunk, CloudWatch |
| Structured logs | Machine-readable log format | JSON logging |
| Log levels | Separate debug, info, warning, error | Standard logging libraries |
| Request tracing | Track requests across services | Distributed tracing (Jaeger, Zipkin) |
| Correlation IDs | Link related log entries | UUID in request headers |
| Monitoring Type | Metrics | Tools |
|---|---|---|
| Service health | Uptime, response time, error rate | Prometheus, Datadog, New Relic |
| Resource usage | CPU, memory, disk, network | Grafana, CloudWatch |
| Business metrics | Transactions, revenue, user activity | Custom dashboards |
| Alerting | Automated notifications | PagerDuty, Opsgenie |
1# Infrastructure as Code example
2git log --oneline infrastructure/
3
4# Output shows all infrastructure changes
5a1b2c3d Update load balancer config - add new backend server
6d4e5f6g Scale database connection pool
7g7h8i9j Deploy application version 2.3.1
Benefits:
Complex systems require ability to quickly deploy new machines when:
| Strategy | Description | Use Case |
|---|---|---|
| Standby servers | Pre-configured machines ready to activate | Immediate failover needs |
| Automated pipelines | Scripts deploy new servers on demand | Scalable, cost-effective |
| Container orchestration | Kubernetes, Docker Swarm manage containers | Microservices architecture |
| Serverless functions | Cloud provider manages infrastructure | Event-driven workloads |
1# Deploy new front-end server
2deploy-script --service ecommerce-frontend --count 1 --region us-east
3
4# Pipeline actions:
5# 1. Provision virtual machine in cloud
6# 2. Install operating system and dependencies
7# 3. Deploy application code
8# 4. Configure networking and security
9# 5. Run health checks
10# 6. Add to load balancer pool
11# 7. Monitor for successful deployment
Time to deployment:
1# Scale up during high traffic
2deploy-script --service ecommerce-frontend --count 5 --region us-east
3
4# Scale down during low traffic
5decommission-script --service ecommerce-frontend --count 3
Benefits:
Cloud and virtual environments often have artificial limits:
| Resource | Limit Type | Example |
|---|---|---|
| CPU time | Percentage cap | 80% CPU usage maximum |
| RAM | Memory allocation | 8 GB maximum per instance |
| Network bandwidth | Throughput cap | 1 Gbps maximum |
| Disk I/O | IOPS limit | 3000 operations/second |
| Storage | Capacity limit | 500 GB per volume |
| Service Type | Limit | Impact |
|---|---|---|
| Database connections | Maximum concurrent connections | Connection exhaustion errors |
| API rate limits | Requests per second | Throttling errors |
| Data storage | Storage quota | Write failures |
| Network connections | Open socket limit | Connection refused errors |
1# Application hitting connection limit
2import psycopg2
3
4try:
5 conn = psycopg2.connect(
6 host="db-server",
7 database="ecommerce",
8 user="app_user",
9 password="password"
10 )
11except psycopg2.OperationalError as e:
12 # Error: FATAL: remaining connection slots are reserved
13 # for non-replication superuser connections
14 print(f"Connection failed: {e}")
Solution approaches:
| Approach | Description | Trade-off |
|---|---|---|
| Increase limit | Request higher connection quota | May incur additional cost |
| Connection pooling | Reuse connections efficiently | Requires code changes |
| Optimize queries | Reduce connection hold time | Development effort |
| Scale horizontally | Add more database replicas | Complexity increases |
1# Check if hitting CPU limit
2top
3# If CPU stuck at specific percentage (e.g., 80%), may be capped
4
5# Check if hitting memory limit
6free -m
7# If consistently at maximum, may need more RAM
8
9# Check if hitting network limit
10iftop
11# If throughput plateaus at specific rate, may be bandwidth cap
12
13# Check database connection limit
14psql -c "SELECT count(*) FROM pg_stat_activity;"
15# Compare to max_connections setting
Caution
Cloud resource limits can be difficult to detect because applications may appear slow or intermittently failing rather than showing obvious error messages about limits.
| Technique | Purpose | When to Use |
|---|---|---|
| Log analysis | Identify error patterns | Always first step |
| Change investigation | Find what broke | When timing suggests recent change |
| Rollback | Restore service quickly | When safe rollback available |
| Server removal | Isolate faulty component | When specific server identified |
| New server deployment | Replace or scale | When infrastructure supports it |
| Resource monitoring | Detect limits and bottlenecks | Performance degradation |
Complex systems require coordination:
Debugging complex systems requires applying single-computer troubleshooting principles at larger scale across multiple interconnected services. Essential techniques include systematic log analysis, identifying recent changes, rolling back suspicious changes even without certainty, removing faulty servers from pools to protect users, and leveraging automated deployment pipelines. Good logging, monitoring, version control, and quick deployment capabilities are foundational requirements. Cloud-based systems introduce additional considerations around resource limits that may artificially cap performance. Success requires combining technical investigation with effective communication and documentation.
Complex distributed systems present debugging challenges that extend beyond single-computer troubleshooting. Effective debugging requires understanding service dependencies, analyzing logs across multiple components, identifying changes through version control, implementing rollback strategies when possible, and managing infrastructure at scale. Modern cloud-based systems enable rapid deployment and scaling but introduce resource constraints that must be understood and managed. The next critical aspect of handling larger incidents involves communication and documentation to ensure team coordination and knowledge preservation.