Course 4 - Troubleshooting and Debugging Techniques

This course teaches you how to troubleshoot and debug common IT issues using Python.

In this section

  • Module 1
    • Troubleshooting Concepts
      This document introduces essential debugging techniques and problem-solving processes for IT specialists. It covers systematic approaches to troubleshooting technical issues, understanding root causes, and applying methods like binary search to resolve problems efficiently.
    • Debugging and Troubleshooting
      This document explains the distinction between debugging and troubleshooting explores essential tools for analyzing system behavior and application code and demonstrates techniques for identifying and resolving technical problems in IT environments.
    • Problem Solving Steps
      This document outlines a systematic three-step approach to solving technical problems: gathering information, finding root causes, and performing remediation. It emphasizes documentation practices and demonstrates these steps through a practical computer overheating scenario.
    • Troubleshooting Example
      This document demonstrates a practical troubleshooting workflow using strace to diagnose an application failure. It walks through information gathering system call analysis, root cause identification, and implementation of both immediate and long-term remediation strategies.
    • Why Things Do Not Work
      This document explores effective information gathering techniques for troubleshooting, demonstrates systematic problem isolation through elimination, and illustrates server performance diagnosis using Linux tools. It emphasizes asking critical questions and avoiding assumptions when diagnosing issues.
    • Reproducing the Problem
      This document explains how to create clear reproduction cases for debugging tricky issues, explores system logs across different operating systems, and demonstrates techniques for isolating problem conditions through systematic testing and environmental analysis.
    • Finding the Root Cause
      This document explains the iterative hypothesis-testing cycle for identifying root causes, demonstrates using test environments for safe troubleshooting and explores diagnostic tools like iotop, iftop, and resource limiting commands to investigate and resolve server performance issues.
    • Intermittent Issues
      This document addresses the challenges of debugging intermittent problems that occur sporadically. It covers logging strategies, debugging modes environmental monitoring, Heisenbugs, resource management issues, and the underlying causes of problems resolved by system restarts.
    • Intermittent Failing Script
      This document presents a real-world case study of debugging a date formatting issue in a meeting reminder application. It demonstrates reproducing problems isolating faulty parameters, adding debug output, identifying root causes, and implementing fixes that work across different locale settings.
    • Binary Search
      This document explains linear and binary search algorithms for finding elements in lists, compares their efficiency using time complexity analysis and demonstrates how binary search dramatically reduces comparisons from thousands to logarithmic numbers when working with sorted data structures.
    • Applying Binary Search in Troubleshooting
      This document explains how to apply the binary search algorithm to troubleshooting scenarios by bisecting problem spaces, reducing potential causes by half with each iteration, and efficiently identifying root causes in configuration files, code commits, browser extensions, and system components through systematic elimination.
    • Finding Invalid Data
      This document demonstrates a practical bisecting troubleshooting example where a CSV import script fails due to corrupt data. It shows how to use Unix command-line tools like head, tail, and wc to systematically divide a 100-line file and identify the specific malformed record causing import errors.
  • Module 2
    • Understanding System Slowness
      This document explores the concept of system slowness in IT environments examining why computers, scripts, and complex systems experience performance degradation. It covers resource limitations, the relative nature of speed expectations, and introduces strategies for identifying and addressing common causes of slowness through systematic resource management and optimization techniques.
    • Reasons for Slowness
      This document examines the fundamental causes of system slowness including CPU time constraints, resource bottlenecks, and hardware limitations. It covers systematic approaches to diagnosing performance issues through resource monitoring tools on Linux, macOS, and Windows, identifying exhausted resources, and determining whether solutions require process management hardware upgrades, or software optimization.
    • How Computer Uses Resources
      This document explains how computers utilize different resources like CPU RAM, disk, and network, including data access speeds, caching strategies, and memory management techniques such as swapping.
    • Causes of System Slowness
      This document explores common causes of computer slowness including startup issues, memory leaks, large files, network file systems, hardware failures and malicious software, with diagnostic strategies and solutions.
    • Troubleshooting Slow Web Server
      This document demonstrates practical troubleshooting of a slow web server using benchmarking tools, process monitoring, priority adjustment, and script optimization to identify and resolve CPU overload caused by parallel video transcoding processes.
    • System Performance Monitoring Tools
      This document provides a comprehensive overview of performance monitoring tools across Windows, Linux, and macOS platforms, including Process Monitor Activity Monitor, Performance Monitor, and specialized methodologies like the USE Method.
    • Writing Efficient Code
      This document explores principles of code efficiency, including when to optimize, cost-benefit analysis of performance improvements, profiling tools and strategies for reducing expensive operations through caching and proper data structures.
    • Choosing the Right Data Structure
      This document examines how choosing appropriate data structures impacts performance, comparing lists and dictionaries in Python and their equivalents across programming languages, with guidance on when to use each structure and avoiding expensive operations.
    • Optimizing Expensive Loops
      This document covers strategies for optimizing loop performance, including moving expensive operations outside loops, limiting iteration scope, using early break statements, and scaling optimization efforts appropriately based on data size.
    • Keeping Local Results and Caching
      This document explores caching strategies for performance optimization including when to create caches, managing cache freshness, validation techniques, appropriate cache lifetimes, and implementing simple to complex caching patterns to avoid expensive repeated operations.
    • Profiling and Optimizing Slow Scripts
      This document demonstrates practical profiling and optimization techniques using a real-world email reminder script. It covers measuring execution time with the time command, using pprofile and kcachegrind for performance analysis, identifying expensive operations in loops, and optimizing code by replacing repeated file operations with dictionary-based caching.
    • Parallelizing Operations for Performance
      This document explores concurrency and parallel execution techniques to improve script performance. It covers operating system process management splitting work across processes and threads, understanding I/O-bound versus CPU-bound operations, and finding the optimal balance of parallel tasks to maximize resource utilization without system degradation.
    • Evolving Solutions for Growing Systems
      This document examines how solutions must evolve as systems grow from simple scripts to complex distributed applications. It demonstrates technology progression through a Secret Santa example, starting with CSV files, advancing through SQLite and database servers, adding caching layers, and ultimately scaling to cloud-based distributed architectures with load balancing.
    • Dealing with Complex Growing Systems
      This document examines performance troubleshooting in large-scale distributed systems with multiple interconnected components. It covers identifying bottlenecks through monitoring infrastructure, optimizing database operations with proper indexing, implementing caching and distribution strategies addressing CPU saturation, and simplifying unnecessarily complex architectures.
    • Using Threads to Improve Performance
      This document demonstrates practical implementation of threading and multiprocessing in Python to optimize image processing performance. It walks through converting a sequential thumbnail generation script to use ThreadPoolExecutor and ProcessPoolExecutor, comparing their performance characteristics and explaining the differences caused by Python's Global Interpreter Lock.
    • Using Threads to Improve Performance
      This document demonstrates practical implementation of threading and multiprocessing in Python to optimize image processing performance. It walks through converting a sequential thumbnail generation script to use ThreadPoolExecutor and ProcessPoolExecutor, comparing their performance characteristics and explaining the differences caused by Python's Global Interpreter Lock.
    • Concurrency and Parallelism in Python
      This document explores concurrency and parallelism strategies in Python for optimizing complex systems. It covers threading and asyncio for I/O-bound tasks, multiprocessing for CPU-bound operations, and techniques for combining both approaches to create efficient, responsive applications with optimal resource utilization.
  • Module 3
    • Crashing Programs
      Learn how to troubleshoot and debug crashing programs effectively, including monitoring strategies, bug reporting, and long-term fixes.
    • System Crash
      This document describes steps to diagnose and resolve system crashes, covering hardware checks, OS and application troubleshooting, and remediation planning. Focus is on isolating root causes and selecting efficient fixes.
    • Understanding Crash Application
      This document summarizes techniques to analyse application crashes using logs tracing tools, change analysis, and minimal reproduction cases. Emphasis is on isolating root causes and collecting evidence for remediation or reporting.
    • Fixing Program
      This document This document outlines practical workarounds for fixing crashing applications when source code cannot be modified, including data pre-processing compatibility wrappers, isolation, and watchdog strategies. Focus is on restoring service and producing high-quality bug reports.
    • Internal Server Error
      This document demonstrates debugging a web server returning HTTP 500 errors by investigating logs, configuration files, process information, and file permissions. Focus is on systematic investigation and root cause identification.
    • Resources For Understanding Crashes
      This document provides resources and tools for understanding computer crashes including hardware failures, OS errors, and software deficiencies. Coverage includes BSoD, system logs, Process Monitor, strace, and system call tracing across platforms.
    • Invalid Memory
      This document explains invalid memory access errors, including segmentation faults, memory management in operating systems, debugging techniques with symbols, and tools like valgrind for detection. Coverage includes common programming errors and remediation strategies.
    • Unhandled Errors
      This document explains unhandled errors and exceptions in high-level languages like Python, covering error types, tracebacks, debugging techniques, logging strategies, and making programs resilient. Focus is on proper error handling and user-friendly failure modes.
    • Working with Someone Else's Code
      This document covers strategies for understanding and fixing problems in code written by others, including reading comments and tests, navigating large codebases, and practicing with open-source projects. Essential skills for maintaining unfamiliar code.
    • Debugging Segmentation Faults
      This document demonstrates debugging segmentation faults using core files and GDB, covering commands like backtrace, up, list, and print to analyze crashes and identify off-by-one errors. Practical walkthrough of C program debugging.
  • Debugging A Python Crash
    • Python Crash Debugging
      This document demonstrates debugging Python exceptions using PDB debugger covering traceback analysis, KeyError investigation, and fixing UTF-8 BOM encoding issues in CSV files. Practical case study of database import script debugging.
    • Debug With Print
      This document covers debugging Python programs using print statements including strategies for variable inspection, execution flow tracking formatted output techniques, and best practices for effective printf debugging. Simple yet powerful debugging technique.
    • Debug With Assert
      This document covers debugging Python programs using assert statements including assertion syntax, sanity checks, precondition validation, and best practices for catching bugs early in development. Proactive bug detection technique.
    • Debug With Try-Except
      This document covers debugging Python programs using try-except blocks for exception handling, including catching specific exceptions, custom exceptions finally clauses, and best practices for graceful error handling. Essential exception handling technique.
    • Debug With Logging Module
      This document covers debugging Python programs using the logging module including log levels, configuration, file output, custom formatters, and best practices for production-grade logging. Professional debugging and monitoring technique.
    • Debug With PDB
      This document covers debugging Python programs using PDB interactive debugger including setting breakpoints, stepping through code, inspecting and modifying variables, and post-mortem debugging. Python's built-in interactive debugger.
    • Other Debugging Techniques
      This document covers additional debugging techniques including IDE breakpoints, Visual Studio Code debugger features, conditional breakpoints variable inspection, and comparing IDE debugging with command-line approaches. IDE-based debugging strategies.
    • AI-Infused Debugging
      This document covers AI-infused debugging and paired programming techniques including AI copilot tools like Google Gemini, GitHub Copilot, ChatGPT collaborative debugging workflows, paired programming practices, and best practices for using AI assistants. AI-powered development assistance.
    • Debugging Complex Systems
      This document covers debugging techniques for complex multi-service systems including log analysis across distributed services, identifying service dependencies, rollback strategies, load balancer troubleshooting, and infrastructure management for cloud-based applications. Distributed system debugging strategies.
    • Communication and Documentation
      This document covers communication and documentation strategies during incident response, including tracking troubleshooting activities communicating with affected users, coordinating team roles like incident commander and communications lead, and creating effective post-incident summaries. Incident management best practices.
    • Postmortems
      This document covers postmortem documentation for incident response, including purpose, structure, essential components like root cause and prevention measures, focusing on learning rather than blame, and practicing postmortem writing for continuous improvement. Learning from incidents through documentation.
  • Module 5
    • Managing Resources
      This document introduces resource management in IT systems, covering how to identify and optimize memory, disk, and network usage. It explores strategies for decluttering system resources and prioritizing work to maximize efficiency and prevent future problems through proactive troubleshooting approaches.
    • How to Prevent Memory Leaks
      This document examines memory leaks in applications, covering how unreleased memory chunks cause system performance issues. It explores memory management in C/C++ versus garbage-collected languages, profiling tools like Valgrind for detecting leaks, and strategies for identifying and resolving memory consumption problems before they exhaust system resources.
    • Managing Disk Space
      This document addresses disk space management challenges in IT systems covering common causes of disk exhaustion from logs to temporary files. It explores diagnostic techniques for identifying space usage patterns, handling deleted but open files, and implementing preventive strategies to avoid disk-related performance degradation and data loss.
    • Network Saturation
      This document explores network performance issues through latency and bandwidth concepts, covering how physical distance and connection capacity affect data transmission. It examines traffic prioritization using traffic shaping, connection limits, and diagnostic tools like iftop for identifying bandwidth consumption patterns across network services.
    • Dealing With Memory Leaks
      This document demonstrates practical memory leak diagnosis and resolution through real-world examples using Python memory profilers. It covers identifying memory consumption patterns in applications, analyzing memory usage with tools like memory_profiler, and fixing code that unnecessarily retains data in memory causing resource exhaustion.
    • Important Tasks
      This document presents the Eisenhower Decision Matrix framework for prioritizing IT tasks by urgency and importance, optimizing time allocation between immediate incidents and long-term planning. It covers managing technical debt, handling interruptions strategically, and ensuring focus time for complex problem-solving and infrastructure improvements that prevent future issues.
    • Prioritizing Tasks
      This document outlines practical task prioritization strategies for IT professionals managing overwhelming workloads, including creating comprehensive task lists, assessing urgency and importance, sizing work effort, timing complex tasks around interruption patterns, and communicating capacity limits when workload exceeds available time through team collaboration or expectation management.
    • Estimating Time
      This document addresses the challenge of accurate time estimation for IT projects and tasks, covering common optimistic biases, comparison-based estimation techniques, task decomposition strategies, integration overhead factors, experience-based multipliers, and documentation practices to improve future estimates through retrospective analysis and stakeholder communication.
    • Communicating With Users
      This document covers effective user communication strategies during incident response, managing expectations, prioritizing work, using ticket tracking systems, and implementing practical time-saving measures.
    • Dealing With Hard Problems
      This document covers strategies for approaching difficult debugging challenges, managing complexity through simplicity, staying calm when stuck leveraging collaboration techniques like rubber duck debugging, and balancing short-term fixes with long-term solutions.
    • Proactive Practices
      This document explains proactive practices for preventing incidents: testing canary deployments, centralized logging, monitoring, ticket automation documentation, and capacity planning.
    • Planning Future Resources usage
      This document explains how to forecast, plan, and provision compute, storage and network resources, and when to consider cloud migration or cleanup strategies.
    • Monitoring and Long-Term Solutions
      This document covers the importance of monitoring systems, alerting strategies, bug reporting best practices, and long-term solution design to prevent recurring issues and maintain system reliability.