This document explores effective information gathering techniques for troubleshooting, demonstrates systematic problem isolation through elimination, and illustrates server performance diagnosis using Linux tools. It emphasizes asking critical questions and avoiding assumptions when diagnosing issues.
This document demonstrates practical troubleshooting methodologies focusing on gathering comprehensive information, asking essential diagnostic questions, applying systematic elimination processes, and analyzing server performance issues. It includes real-world examples of diagnosing website failures and emphasizes the critical importance of understanding problems before attempting solutions.
The first step to solving any problem involves gathering enough information to understand the current state of things. This requires identifying the actual issue that needs solving. Issues typically come to attention through ticketing systems, user reports, or direct encounters with problems.
When working with users, problem reports frequently amount to simply stating “It doesn’t work.” While these reports lack detailed information, acknowledging and addressing reported problems remains essential. The usefulness of specific information varies depending on the problem, but certain fundamental questions apply universally to vague problem reports.
When receiving a report that something doesn’t work, four critical questions help establish a foundation for troubleshooting:
| Question | Purpose |
|---|---|
| What were you trying to do? | Identifies the intended action or goal |
| What steps did you follow? | Reveals the sequence of actions taken |
| What was the expected result? | Establishes normal behavior expectations |
| What was the actual result? | Documents the actual observed behavior |
If the ticketing system permits, incorporating these questions into the problem reporting form saves time and enables more specific follow-up questions immediately. Otherwise, these questions invariably become the initial inquiry in any troubleshooting conversation.
When debugging problems, considering the simplest explanations first and avoiding premature jumps to complex or time-consuming solutions proves essential. This principle explains why checking if a device is properly plugged in and receiving power precedes disassembling it or replacing it with a new unit when it fails to turn on.
Important
Always start with the simplest possible explanations and test them systematically before moving to more complex hypotheses. This approach saves time and prevents chasing incorrect solutions.
Consider a scenario where a user reports that the internal website used by the sales team to track customer interactions doesn’t work. The user experiences stress because they need to access information for an imminent meeting.
After assuring the user of immediate attention, the four essential questions reveal the following:
| Question | User Response |
|---|---|
| What were you trying to do? | Access the website |
| What steps did you follow? | Opened the website URL and entered credentials |
| What was the expected result? | See the sales system’s landing page |
| What was the actual result? | The web page keeps loading and stays blank indefinitely |
This transforms the vague “it doesn’t work” into a specific symptom: “when attempting to log in, the page continues loading indefinitely and never displays the landing page.”
With a basic understanding of the problem, the process of elimination begins, starting with the simplest explanations and testing systematically until isolating the root cause.
The first test involves attempting to reproduce the issue on another computer. Navigating to the website and entering credentials confirms the problem: the page continues loading without ever displaying the landing page. This simple action rules out the user or the user’s computer as the cause, immediately cutting the troubleshooting scope in half. The problem clearly resides with the service itself.
Before investigating the server hosting the application, quick checks verify whether the problem affects only the specific website or extends more broadly:
Testing external Internet access by loading an external website confirms that Internet connectivity functions correctly. Checking other internal websites reveals that the ticketing system loads without issues, but the inventory website also fails to complete loading. Further investigation shows both affected websites are hosted on the same server.
| System Tested | Status | Implication |
|---|---|---|
| External website | Working | Internet connection functional |
| Ticketing system | Working | Internal network operational |
| Sales website | Failing | Problem isolated to specific server |
| Inventory website | Failing | Problem isolated to specific server |
These quick verification checks help isolate the root cause. By examining simple explanations first, time is not wasted pursuing incorrect problems.
With the problem isolated to websites running on a specific server while other systems and Internet connectivity work correctly, the next step involves investigating what is occurring on that server.
The server runs Linux, requiring connection via SSH. Running the top command reveals the computer’s state and processes consuming the most CPU resources. The output shows the computer is severely overloaded.
1# Connect to the server
2ssh server.example.com
3
4# Check system performance
5top
The load average displayed in the first line shows 40. On Linux systems, load average indicates how much time a processor remains busy during a given minute, with 1 meaning continuously busy for the entire minute. Normally, this number should not exceed the number of processors in the computer. A number higher than the processor count indicates system overload. With this computer having four cores, a load average of 40 represents severe overload.
| Metric | Value | Normal Range | Status |
|---|---|---|---|
| Load Average | 40 | ≤ 4 (number of cores) | Severely overloaded |
| CPU Time | Mostly waiting | Balanced processing | Processes stuck in I/O |
The output reveals that most CPU time is spent waiting. This indicates processes are stuck waiting for the operating system to return from system calls, typically occurring when processes get stuck gathering data from the hard drive or network. Examining the process list shows the backup system currently running on the server and consuming significant processing time.
While backing up system data is critically important, the current situation renders the entire system unusable. The decision is made to stop the backup system using the kill -STOP command, which suspends program execution until explicitly continued or terminated.
1# Suspend the backup process
2kill -STOP <backup_process_pid>
After suspending the backup process, running top again confirms the load is decreasing and processes are no longer stuck waiting for I/O operations. Testing website login now successfully loads the landing page.
The user receives notification that the website is accessible again. At this point, immediate remediation has been applied.
Note
Immediate remediation addresses urgent symptoms to restore service. Long-term remediation to prevent recurrence requires additional analysis and planning.
Consider a scenario occurring the following week when another user reports that the sales website doesn’t work. Remembering the previous incident, the immediate response involves SSHing onto the server to find and stop the backup process. However, the backup process is not running.
This reveals a critical oversight: failing to ask the user what they meant by “doesn’t work.” When calling back to clarify, the user explains they are attempting to generate a monthly sales report and receive an error stating “the product category column doesn’t exist.” This represents an entirely different problem requiring completely different actions.
| Assumption | Reality | Lesson |
|---|---|---|
| Same problem as before (loading issue) | Database schema error | Never assume based on previous incidents |
| Backup process causing overload | Missing database column | Always gather specific information first |
Caution
Never assume the nature of a problem based on previous incidents. Always obtain a clear, detailed picture of the current issue before attempting solutions.
Effective troubleshooting begins with gathering comprehensive information through systematic questioning. The four essential questions—what was attempted, what steps were followed, what was expected, and what actually occurred—transform vague reports into actionable problem descriptions. Applying the principle of simplicity and systematic elimination isolates root causes efficiently. Server diagnostics using tools like top provide critical insights into performance issues. Most importantly, maintaining clear communication and avoiding assumptions based on past incidents ensures appropriate solutions are applied to current problems. Every problem deserves fresh analysis to understand its unique characteristics before implementing remediation strategies.