Why Things Do Not Work

This document explores effective information gathering techniques for troubleshooting, demonstrates systematic problem isolation through elimination, and illustrates server performance diagnosis using Linux tools. It emphasizes asking critical questions and avoiding assumptions when diagnosing issues.

This document demonstrates practical troubleshooting methodologies focusing on gathering comprehensive information, asking essential diagnostic questions, applying systematic elimination processes, and analyzing server performance issues. It includes real-world examples of diagnosing website failures and emphasizes the critical importance of understanding problems before attempting solutions.


Gathering Sufficient Information

The first step to solving any problem involves gathering enough information to understand the current state of things. This requires identifying the actual issue that needs solving. Issues typically come to attention through ticketing systems, user reports, or direct encounters with problems.

The “It Doesn’t Work” Problem

When working with users, problem reports frequently amount to simply stating “It doesn’t work.” While these reports lack detailed information, acknowledging and addressing reported problems remains essential. The usefulness of specific information varies depending on the problem, but certain fundamental questions apply universally to vague problem reports.


Essential Diagnostic Questions

When receiving a report that something doesn’t work, four critical questions help establish a foundation for troubleshooting:

QuestionPurpose
What were you trying to do?Identifies the intended action or goal
What steps did you follow?Reveals the sequence of actions taken
What was the expected result?Establishes normal behavior expectations
What was the actual result?Documents the actual observed behavior

If the ticketing system permits, incorporating these questions into the problem reporting form saves time and enables more specific follow-up questions immediately. Otherwise, these questions invariably become the initial inquiry in any troubleshooting conversation.


The Principle of Simplicity

When debugging problems, considering the simplest explanations first and avoiding premature jumps to complex or time-consuming solutions proves essential. This principle explains why checking if a device is properly plugged in and receiving power precedes disassembling it or replacing it with a new unit when it fails to turn on.


Case Study: Sales Website Failure

Consider a scenario where a user reports that the internal website used by the sales team to track customer interactions doesn’t work. The user experiences stress because they need to access information for an imminent meeting.

Initial Information Gathering

After assuring the user of immediate attention, the four essential questions reveal the following:

QuestionUser Response
What were you trying to do?Access the website
What steps did you follow?Opened the website URL and entered credentials
What was the expected result?See the sales system’s landing page
What was the actual result?The web page keeps loading and stays blank indefinitely

This transforms the vague “it doesn’t work” into a specific symptom: “when attempting to log in, the page continues loading indefinitely and never displays the landing page.”

Root Cause Investigation Through Elimination

With a basic understanding of the problem, the process of elimination begins, starting with the simplest explanations and testing systematically until isolating the root cause.

Reproduction Testing

The first test involves attempting to reproduce the issue on another computer. Navigating to the website and entering credentials confirms the problem: the page continues loading without ever displaying the landing page. This simple action rules out the user or the user’s computer as the cause, immediately cutting the troubleshooting scope in half. The problem clearly resides with the service itself.

Network and Service Isolation

Before investigating the server hosting the application, quick checks verify whether the problem affects only the specific website or extends more broadly:

Testing external Internet access by loading an external website confirms that Internet connectivity functions correctly. Checking other internal websites reveals that the ticketing system loads without issues, but the inventory website also fails to complete loading. Further investigation shows both affected websites are hosted on the same server.

System TestedStatusImplication
External websiteWorkingInternet connection functional
Ticketing systemWorkingInternal network operational
Sales websiteFailingProblem isolated to specific server
Inventory websiteFailingProblem isolated to specific server

These quick verification checks help isolate the root cause. By examining simple explanations first, time is not wasted pursuing incorrect problems.


Server-Level Diagnosis

With the problem isolated to websites running on a specific server while other systems and Internet connectivity work correctly, the next step involves investigating what is occurring on that server.

SSH Connection and Performance Analysis

The server runs Linux, requiring connection via SSH. Running the top command reveals the computer’s state and processes consuming the most CPU resources. The output shows the computer is severely overloaded.

1# Connect to the server
2ssh server.example.com
3
4# Check system performance
5top

Understanding Load Average

The load average displayed in the first line shows 40. On Linux systems, load average indicates how much time a processor remains busy during a given minute, with 1 meaning continuously busy for the entire minute. Normally, this number should not exceed the number of processors in the computer. A number higher than the processor count indicates system overload. With this computer having four cores, a load average of 40 represents severe overload.

MetricValueNormal RangeStatus
Load Average40≤ 4 (number of cores)Severely overloaded
CPU TimeMostly waitingBalanced processingProcesses stuck in I/O

Identifying the Problem Process

The output reveals that most CPU time is spent waiting. This indicates processes are stuck waiting for the operating system to return from system calls, typically occurring when processes get stuck gathering data from the hard drive or network. Examining the process list shows the backup system currently running on the server and consuming significant processing time.


Immediate Remediation

While backing up system data is critically important, the current situation renders the entire system unusable. The decision is made to stop the backup system using the kill -STOP command, which suspends program execution until explicitly continued or terminated.

1# Suspend the backup process
2kill -STOP <backup_process_pid>

After suspending the backup process, running top again confirms the load is decreasing and processes are no longer stuck waiting for I/O operations. Testing website login now successfully loads the landing page.

The user receives notification that the website is accessible again. At this point, immediate remediation has been applied.


The Importance of Clear Problem Definition

Consider a scenario occurring the following week when another user reports that the sales website doesn’t work. Remembering the previous incident, the immediate response involves SSHing onto the server to find and stop the backup process. However, the backup process is not running.

This reveals a critical oversight: failing to ask the user what they meant by “doesn’t work.” When calling back to clarify, the user explains they are attempting to generate a monthly sales report and receive an error stating “the product category column doesn’t exist.” This represents an entirely different problem requiring completely different actions.

AssumptionRealityLesson
Same problem as before (loading issue)Database schema errorNever assume based on previous incidents
Backup process causing overloadMissing database columnAlways gather specific information first

Conclusion

Effective troubleshooting begins with gathering comprehensive information through systematic questioning. The four essential questions—what was attempted, what steps were followed, what was expected, and what actually occurred—transform vague reports into actionable problem descriptions. Applying the principle of simplicity and systematic elimination isolates root causes efficiently. Server diagnostics using tools like top provide critical insights into performance issues. Most importantly, maintaining clear communication and avoiding assumptions based on past incidents ensures appropriate solutions are applied to current problems. Every problem deserves fresh analysis to understand its unique characteristics before implementing remediation strategies.


FAQ