Troubleshooting Slow Web Server

This document demonstrates practical troubleshooting of a slow web server using benchmarking tools, process monitoring, priority adjustment, and script optimization to identify and resolve CPU overload caused by parallel video transcoding processes.

This document walks through a real-world web server performance investigation, from initial user report to resolution. It demonstrates using Apache Benchmark for performance measurement, analyzing process loads with top, adjusting process priorities, investigating script automation, and implementing sequential processing to eliminate CPU overload.


Initial Problem Report and Verification

A user has alerted that one of the web servers is being slow. The investigation begins by navigating to the website and loading the page. The page loads, but appears to be slow, though it’s hard to measure subjectively.

Baseline Performance Measurement

To quantify the slowness, the Apache Benchmark tool (ab) is used. This tool is super useful for checking if a website is behaving as expected or not. It makes a bunch of requests and summarizes the results once it’s done.

Command Syntax:

1ab -n 500 http://site.example.com

This command requests 500 iterations to get an average timing measurement. There are many more options available, such as controlling how many requests run simultaneously or setting timeouts if not all requests complete.

Making 500 requests allows for calculating an average response time, which provides a reliable baseline for determining if performance is actually degraded.


Performance Analysis Results

After the tool finishes running the 500 requests, the data can be examined to determine if the server is actually slow.

Initial Benchmark Results:

MetricValueAssessment
Mean time per request155 millisecondsAbnormally high for simple website
Total requests500Completed successfully
Expected response time< 50 millisecondsBased on site complexity

While 155 milliseconds is not a super huge number, it’s definitely more than expected for such a simple website. It seems that something is going on with the web server and further investigation is needed.


System Resource Investigation

The next step is to connect to the web server and check what’s happening. The investigation starts by examining the output of the top command to identify suspicious activity.

Process Analysis with top

The top output reveals critical information:

Observed Issues:

ObservationDetailsSignificance
ffmpeg processesMultiple instances runningUsing all available CPU
Load average~30Severely overloaded system
CPU count2 processorsNormal load should be ≤ 2

The load average on Linux shows how much time the processor is busy at a given minute, with one meaning it was busy for the whole minute. This computer has two processors, so any number above two means that it’s overloaded. During each minute, there were more processes waiting for processor time than the processor had to give.

Root Cause Identification

The ffmpeg program is used for video transcoding, which means converting files from one video format to another. This is a CPU-intensive process and seems like the likely culprit for the server being overloaded.


First Mitigation Attempt: Process Priority Adjustment

One approach to try is changing the process priorities so that the web server takes precedence. The process priorities in Linux are structured so that the lower the number, the higher the priority. Typical numbers go from 0 to 19. By default, processes start with a priority of zero.

Priority Management Commands

CommandPurposeUsage
niceStarting a process with different prioritynice -n 19 command
reniceChanging priority of running processrenice 19 PID
pidofGetting process IDs by namepidof process_name

Automated Priority Adjustment

Rather than manually adjusting each process one by one (which would be manual, error-prone, and super boring), a shell script can automate this:

1for pid in $(pidof ffmpeg); do renice 19 $pid; done

Script Breakdown:

  • pidof ffmpeg: Returns all process IDs that have the name ffmpeg
  • for pid in $(...): Iterates over each returned process ID
  • renice 19 $pid: Sets priority to 19 (lowest possible priority)

The priorities for those processes are successfully updated.

Re-testing After Priority Change

Running the benchmarking software again to check if priority adjustment made any difference:

Post-Renice Results:

MetricBeforeAfter Priority ChangeChange
Mean time per request155 ms153 ms-2 ms (negligible)

The renice didn’t help significantly. Apparently, the OS is still giving these ffmpeg processes way too much processor time. The website is still slow.


Alternative Approach: Sequential Processing

These transcoding processes are CPU intensive, and running them in parallel is overloading the computer. A better approach is to modify whatever’s triggering them to run one after the other instead of all at the same time.

Investigating Process Origins

To implement this change, the investigation needs to find out how these processes got started.

Examining process details with ps ax | less shows all running processes on the computer. Using less allows scrolling through the output.

Within less, use /ffmpeg to search. The results show multiple ffmpeg processes converting videos from webm format to mp4 format.

Since the location of these videos on the hard drive is unknown, the locate command can help:

1locate static/001.webm

Result: The static directory is located in /server/deploy/videos/

Script Discovery and Analysis

Changing into the deploy directory and searching for the automation script:

1cd /server/deploy/videos/
2grep -r ffmpeg *

Search Results:

The deploy.sh file contains multiple mentions of ffmpeg.

Examining the Problematic Script

Using vim (a command-line editor, since the connection is remote) to examine the file:

1vim deploy.sh

Script Analysis:

The script starts ffmpeg processes in parallel using a tool called daemonize that runs each program separately as if it were a daemon. This might be okay for converting a couple of videos, but launching one separate process for each video in the static directory is overloading the server.

ApproachBehaviorImpact
Parallel (daemonize)All videos convert simultaneouslyCPU overload, server unresponsive
SequentialOne video converts at a timeManageable CPU load, server responsive

Script Modification

The fix is to change the script to run only one video conversion process at a time. This is accomplished by deleting the daemonized part and keeping the part that calls ffmpeg.

Before:

1daemonize ffmpeg -i input.webm output.mp4

After:

1ffmpeg -i input.webm output.mp4

The file is saved and exited.


Managing Running Processes

The script has been modified, but this won’t change the processes that are already running. These processes need to be stopped, but not canceled completely, as doing so would mean that the videos being converted right now will be incomplete.

Pausing All ffmpeg Processes

The killall command with the -STOP flag sends a stop signal but doesn’t kill the processes completely:

1killall -STOP ffmpeg

This suspends all running ffmpeg processes without terminating them.

Sequential Process Resumption

The goal is to run these processes one at a time. This could be done by sending the CONT signal to one process, waiting until it’s done, and then sending it to the next one. But that’s a lot of manual work that can be automated.

Automated Sequential Processing Script:

1for pid in $(pidof ffmpeg); do
2  while kill -CONT $pid 2>/dev/null; do
3    sleep 1
4  done
5done

Script Logic Breakdown:

ComponentPurpose
for pid in $(pidof ffmpeg)Iterate through all ffmpeg process IDs
while kill -CONT $pid 2>/dev/nullSend CONT signal; succeeds while process exists
sleep 1Wait one second before next check
Loop exitWhen process finishes, kill command fails, exits while loop

How It Works:

  1. Iterate through the list of processes using the same for loop with pidof command
  2. Inside the for loop, send the CONT signal and wait until the process is done
  3. There’s no built-in command to wait until a process finishes, so a while loop is created
  4. The while loop sends the CONT signal to the process, which succeeds as long as the process exists and fails once the process goes away
  5. Inside the while loop, sleep 1 waits one second until the next check

The server is now running one ffmpeg process at a time.


Final Performance Verification

Running the benchmark one more time to verify the fix:

1ab -n 500 http://site.example.com

Final Results:

MetricInitialAfter ReniceAfter Sequential ProcessingImprovement
Mean time per request155 ms153 ms33 ms78.7% faster

The mean time is now 33 milliseconds. That’s much lower than before. The web server has been successfully restored to reply promptly to requests again.


Solution Summary

Several different approaches were demonstrated for situations where the code can’t be fixed:

ApproachImplementationResult
Process priority adjustment (renice)Change process priorities to favor web serverMinimal improvement (155ms → 153ms)
Sequential processingRun CPU-intensive tasks one at a timeSignificant improvement (155ms → 33ms)

The key lesson is that when parallel CPU-intensive processes overload a system, sequential execution is often necessary. Process priority adjustment alone is insufficient when resource contention is severe.

Techniques Demonstrated:

  • Apache Benchmark (ab) for performance measurement
  • top command for process monitoring
  • nice/renice for priority management
  • ps command for detailed process information
  • grep for searching configuration files
  • vim for remote file editing
  • killall with signals for process control
  • Shell scripting for process automation

Conclusion

Troubleshooting the slow web server required systematic investigation from initial performance measurement through root cause identification to solution implementation. The Apache Benchmark tool provided quantifiable metrics showing 155ms average response time. Investigation with top revealed massive CPU overload from parallel ffmpeg video transcoding processes, with load averages around 30 on a 2-CPU system. The first mitigation attempt using renice to lower ffmpeg process priorities yielded minimal improvement. Deeper investigation using ps, locate, and grep revealed a deployment script launching all video conversions in parallel using daemonize. The solution involved modifying the script to remove parallelization, managing already-running processes with killall -STOP, and implementing a shell script to resume processes sequentially. Final benchmarking confirmed success with response times dropping to 33ms, a 78.7% improvement. This case study demonstrates that addressing root causes (parallel resource contention) is far more effective than symptomatic treatments (priority adjustment), and showcases essential Linux troubleshooting tools and techniques for performance optimization.


FAQ