This document explores concurrency and parallel execution techniques to improve script performance. It covers operating system process management splitting work across processes and threads, understanding I/O-bound versus CPU-bound operations, and finding the optimal balance of parallel tasks to maximize resource utilization without system degradation.
This document examines concurrency and parallel execution as performance optimization strategies. It explains how operating systems manage processes, demonstrates techniques for splitting work across multiple processes and threads, distinguishes between I/O-bound and CPU-bound operations, and provides guidance on balancing parallelism to maximize throughput without overwhelming system resources.
Reading information from disk or transferring data over the network is a slow operation. In typical scripts, while this operation is ongoing, nothing else happens. The script is blocked, waiting for input or output while the CPU sits idle.
| Operation Type | Typical Speed | CPU State During Operation | Opportunity |
|---|---|---|---|
| Disk I/O | Milliseconds to seconds | Idle, waiting | Can perform other work |
| Network I/O | Milliseconds to seconds | Idle, waiting | Can perform other work |
| CPU computation | Microseconds to milliseconds | Active, processing | Limited parallelism opportunity |
One way to improve performance is to do operations in parallel. That way, while the computer is waiting for slow I/O, other work can take place.
The tricky part is dividing up the tasks so that the same result is achieved in the end. There is actually a whole field of computer science called concurrency, dedicated to how programs are written that do operations in parallel.
Note
Concurrency is a vast and complex field in computer science. This document provides a practical overview of techniques applicable to everyday scripting and automation tasks, rather than comprehensive theoretical coverage.
The operating system handles the many processes that run on a computer, providing fundamental concurrency capabilities that can be leveraged for parallel execution.
OS Process Management Capabilities:
| Capability | Function | Benefit |
|---|---|---|
| Multi-core scheduling | Assigns processes to different CPU cores | True parallel execution |
| Time-slicing | Switches between processes on same core | Concurrent execution appearance |
| Memory isolation | Each process has own memory allocation | Process independence and safety |
| I/O management | Handles independent I/O calls per process | Non-blocking I/O across processes |
If a computer has more than one core, the operating system can decide which processes get executed on which core. Regardless of the split between cores, all of these processes will be executing in parallel.
Process characteristics:
A very easy way to run operations in parallel is to split them across different processes, calling a script many times each with a different input set, and letting the operating system handle the concurrency.
Consider a scenario where statistics need to be collected on the current load and memory usage for all computers in a network.
Sequential approach:
| Approach | Execution Pattern | Total Time |
|---|---|---|
| Single process | Connect to computer 1, wait → connect to computer 2, wait → … | Sum of all connection times |
| Serial execution | No parallelism | Maximum delay |
A script can be written that connects to each computer in a list and gets the stats. Each connection takes a while to complete, so the total runtime of the script would be the sum of the time taken by all those connections.
Parallel approach:
| Approach | Execution Pattern | Total Time |
|---|---|---|
| Multiple processes | Split into groups, run simultaneously | Approximately time for one group |
| Parallel execution | OS manages concurrency | Minimized idle time |
Instead, the list of computers could be split into smaller groups and the OS used to call the script many times, once for each group. That way, the connections to the different computers can be started in parallel, which minimizes the time when the CPU isn’t doing anything.
Implementation comparison:
1# Sequential execution - slow
2python3 get_stats.py computer1 computer2 computer3 ... computer100
3
4# Parallel execution - fast
5python3 get_stats.py computer1-computer25 &
6python3 get_stats.py computer26-computer50 &
7python3 get_stats.py computer51-computer75 &
8python3 get_stats.py computer76-computer100 &
9wait
Important
This approach is super easy to implement and for many scripts, it will be the right choice. Process-level parallelism requires minimal code changes while providing significant performance benefits.
Another easy optimization is to have a good balance of different workloads running on a computer.
Workload balance strategy:
| Process Type | Primary Resource | Impact on Others |
|---|---|---|
| CPU-intensive | CPU utilization | Low impact on I/O processes |
| Network I/O-intensive | Network bandwidth | Low impact on CPU/disk processes |
| Disk I/O-intensive | Disk I/O | Low impact on CPU/network processes |
If one process is using a lot of CPU while a different process is using a lot of network I/O and another process is using a lot of disk I/O, these can all run in parallel without interfering with each other.
Optimal workload mix:
1Process A: Heavy CPU computation (image processing)
2Process B: Network data transfer (downloading files)
3Process C: Disk operations (database indexing)
4
5Result: All three processes run efficiently in parallel
6Each uses different resources, minimizing contention
When using the OS to split work into processes, these processes don’t share any memory. Sometimes there might be a need to have some shared data between parallel tasks.
In that case, threads are used. Threads let parallel tasks run inside a process. This allows threads to share some of the memory with other threads in the same process.
Process vs Thread comparison:
| Aspect | Process | Thread |
|---|---|---|
| Memory | Isolated, no sharing | Shared within same process |
| Management | OS-managed | Application-managed |
| Overhead | Higher (separate memory space) | Lower (shared memory space) |
| Communication | IPC mechanisms required | Direct memory access |
| Isolation | Complete isolation | Partial isolation |
| Creation cost | Expensive | Relatively cheap |
Since threading isn’t handled by the OS in the same way, code must be modified to create and handle the threads.
For threading implementation, the programming language’s threading capabilities must be examined. In Python, the Threading or AsyncIO modules can be used to accomplish this.
Python concurrency modules:
| Module | Approach | Best For | Complexity |
|---|---|---|---|
| Threading | Thread-based parallelism | I/O-bound tasks | Moderate |
| AsyncIO | Async/await pattern | I/O-bound tasks, many connections | Moderate to High |
| Multiprocessing | Process-based parallelism | CPU-bound tasks | Moderate |
These modules allow specification of:
One thing to watch out for is that depending on the actual threading implementation for the language being used, it might happen that all threads get executed in the same CPU processor.
Caution
Python’s Global Interpreter Lock (GIL) means that threading doesn’t provide true parallelism for CPU-bound tasks. If true multi-processor execution is needed, the code must be split into fully separate processes using the multiprocessing module.
When to use each approach:
| Task Type | Recommended Approach | Reason |
|---|---|---|
| I/O-bound (network, disk) | Threading or AsyncIO | GIL released during I/O operations |
| CPU-bound (calculations) | Multiprocessing | True parallel execution on multiple cores |
| Mixed workload | Combination of both | Optimize each component appropriately |
If a script is mostly just waiting on input or output (also known as I/O bound), it might not matter if it’s executed on one processor or eight.
I/O-bound characteristics:
| Aspect | Behavior | Optimization Strategy |
|---|---|---|
| CPU usage | Low, mostly idle | Threading sufficient |
| Waiting time | High, blocked on I/O | Parallelize I/O operations |
| Scaling | Limited by I/O bandwidth | Increase concurrent connections |
But parallelization might be done because all of the available CPU time is being used. In other words, the script is CPU-bound.
CPU-bound characteristics:
| Aspect | Behavior | Optimization Strategy |
|---|---|---|
| CPU usage | High, constantly processing | Split across processors |
| Waiting time | Minimal, active computation | Use multiprocessing |
| Scaling | Limited by CPU cores | Add more cores or optimize algorithm |
In the case of CPU-bound operations, execution definitely needs to be split across processors to see performance benefits.
1# I/O-bound example - threading works well
2import threading
3
4def fetch_data(url):
5 # Network I/O - CPU idle during wait
6 response = requests.get(url)
7 return response.text
8
9threads = [threading.Thread(target=fetch_data, args=(url,))
10 for url in urls]
1# CPU-bound example - multiprocessing needed
2import multiprocessing
3
4def compute_intensive_task(data):
5 # Heavy computation - CPU constantly busy
6 result = perform_complex_calculation(data)
7 return result
8
9pool = multiprocessing.Pool(processes=4)
10results = pool.map(compute_intensive_task, data_chunks)
There is a point where adding more parallel processes means things become even slower, not faster.
Over-parallelization consequences:
| Scenario | Problem | Impact |
|---|---|---|
| Too many disk operations | Excessive seeking | More time repositioning than reading |
| Too many CPU operations | Excessive context switching | More time switching than computing |
| Too many network operations | Bandwidth saturation | Network congestion and timeouts |
If too many file read operations are attempted in parallel from disk, the disk might end up spending more time going from one position to another than actually retrieving the data.
Disk seek overhead:
| Parallel Operations | Seek Time | Read Time | Efficiency |
|---|---|---|---|
| 1-2 operations | Low | High | Good |
| 5-10 operations | Moderate | Moderate | Acceptable |
| 50+ operations | Very High | Low | Poor |
If a ton of operations that use a lot of CPU are executed, the OS could spend more time switching between them than actually making progress in the calculations being attempted.
Context switching overhead:
| Process Count | Switching Overhead | Useful Work | Performance |
|---|---|---|---|
| Equal to cores | Minimal | Maximum | Optimal |
| 2× cores | Low | High | Good |
| 10× cores | High | Moderate | Degraded |
| 100× cores | Very High | Low | Poor |
When doing operations in parallel, the right balance of simultaneous actions must be found that lets computers stay busy without starving the system for resources.
Important
The optimal level of parallelism depends on the specific hardware, workload type, and resource availability. Profiling and benchmarking with different parallelism levels helps identify the sweet spot for each scenario.
Finding the optimal balance:
| Resource Type | Recommended Approach | Starting Point |
|---|---|---|
| Disk I/O | Match to disk characteristics | 2-4 concurrent operations per disk |
| Network I/O | Match to bandwidth and latency | 10-50 concurrent connections |
| CPU-bound | Match to core count | Number of CPU cores or cores - 1 |
A data migration project required converting data stored in one format to a different format. There were many gigabytes of data that needed migrating, making manual processing impossible.
The first version of the script was taking an average of one hour per gigabyte migrated. This was much slower than expected, prompting investigation into optimization opportunities.
Initial performance:
| Metric | Value | Assessment |
|---|---|---|
| Processing rate | 1 hour per GB | Too slow |
| Approach | Sequential processing | Not utilizing available resources |
| Parallelism | None | Major opportunity for improvement |
The logic was reorganized to have a separate thread per file, which decreased the total time to work through the files since it now wasn’t a linear process.
First optimization results:
| Approach | Processing Pattern | Improvement |
|---|---|---|
| Sequential | One file at a time | Baseline (1 hour/GB) |
| Threaded | Multiple files concurrently | Significant improvement |
To make the process even faster, the work was split onto different machines, each running a bunch of threads.
Final architecture:
| Component | Implementation | Contribution |
|---|---|---|
| Multiple machines | Distributed processing | Linear scaling with machines |
| Threads per machine | Concurrent file processing | Utilized I/O wait time |
| Combined approach | Horizontal + vertical scaling | Maximum resource utilization |
After all this rearranging to use available resources, the processing time was brought down to three minutes per gigabyte.
Performance improvement summary:
| Version | Time per GB | Improvement Factor | Total Time for 100 GB |
|---|---|---|---|
| Initial | 60 minutes | 1× (baseline) | 6,000 minutes (100 hours) |
| Final | 3 minutes | 20× faster | 300 minutes (5 hours) |
This represents a dramatic 20× performance improvement through proper application of concurrency and parallelism techniques.
Note
This real-world example demonstrates that significant performance improvements are achievable through parallelization. The combination of threading for I/O-bound operations and distribution across machines provided multiplicative benefits.
| Question | I/O-Bound Answer | CPU-Bound Answer |
|---|---|---|
| What’s the bottleneck? | Waiting on I/O | CPU computation |
| Best parallelism approach? | Threading or AsyncIO | Multiprocessing |
| Optimal parallel tasks? | 10-100+ depending on I/O | Equal to CPU cores |
| Shared data needed? | Use threads | Use processes with IPC |
| Primary optimization? | Maximize concurrent I/O | Minimize context switching |
| Step | Action | Purpose |
|---|---|---|
| 1 | Profile the application | Identify if I/O-bound or CPU-bound |
| 2 | Choose parallelism strategy | Threads/AsyncIO for I/O, processes for CPU |
| 3 | Start with conservative parallelism | 2-4× parallelism factor |
| 4 | Benchmark performance | Measure actual improvement |
| 5 | Incrementally increase parallelism | Find optimal balance |
| 6 | Monitor resource utilization | Avoid over-parallelization |
| Practice | Description | Benefit |
|---|---|---|
| Start simple | Use OS-level process parallelism first | Minimal code changes |
| Profile before optimizing | Measure actual bottlenecks | Avoid premature optimization |
| Test incrementally | Increase parallelism gradually | Find optimal level |
| Monitor resources | Track CPU, memory, I/O usage | Prevent resource starvation |
| Consider workload mix | Balance CPU and I/O tasks | Maximize overall throughput |
Parallelizing operations is a powerful technique for improving script performance, particularly when dealing with I/O-bound operations where the CPU sits idle during disk reads, network transfers, or other slow operations. The operating system provides fundamental concurrency capabilities through process management, allowing easy parallelization by splitting work across multiple script invocations with different input sets. For scenarios requiring shared data, threads enable parallel execution within a single process, though language-specific threading implementations may limit true multi-processor execution. Python offers Threading and AsyncIO modules for I/O-bound tasks and multiprocessing for CPU-bound operations that need true parallel execution across multiple cores. The distinction between I/O-bound and CPU-bound operations is critical: I/O-bound scripts benefit from threading or async patterns regardless of processor count, while CPU-bound scripts require splitting across multiple processors for performance gains. However, excessive parallelization can degrade performance when disk operations spend more time seeking than reading, or when the OS spends more time context switching than performing useful work. Finding the right balance requires understanding the workload characteristics and available resources. The data migration case study demonstrated a 20× performance improvement (from 60 minutes to 3 minutes per gigabyte) by reorganizing code to use threads per file and distributing work across multiple machines. When implementing parallelism, the decision framework should consider whether the bottleneck is I/O or CPU, whether shared data is needed, and what level of parallelism optimizes resource utilization without overwhelming the system. Starting with simple OS-level process parallelism, profiling to identify bottlenecks, and incrementally increasing parallelism while monitoring resources provides the most reliable path to performance optimization through concurrency.