Parallelizing Operations for Performance

November 11, 2025 12 min read Programming Performance Docs Automation-With-Python Parallel-Processing Threading Performance-Optimization

This document explores concurrency and parallel execution techniques to improve script performance. It covers operating system process management splitting work across processes and threads, understanding I/O-bound versus CPU-bound operations, and finding the optimal balance of parallel tasks to maximize resource utilization without system degradation.

On this page

This document examines concurrency and parallel execution as performance optimization strategies. It explains how operating systems manage processes, demonstrates techniques for splitting work across multiple processes and threads, distinguishes between I/O-bound and CPU-bound operations, and provides guidance on balancing parallelism to maximize throughput without overwhelming system resources.

The Problem with Blocking I/O Operations

Understanding Blocked Execution

Reading information from disk or transferring data over the network is a slow operation. In typical scripts, while this operation is ongoing, nothing else happens. The script is blocked, waiting for input or output while the CPU sits idle.

Operation Type	Typical Speed	CPU State During Operation	Opportunity
Disk I/O	Milliseconds to seconds	Idle, waiting	Can perform other work
Network I/O	Milliseconds to seconds	Idle, waiting	Can perform other work
CPU computation	Microseconds to milliseconds	Active, processing	Limited parallelism opportunity

The Parallelization Solution

One way to improve performance is to do operations in parallel. That way, while the computer is waiting for slow I/O, other work can take place.

The tricky part is dividing up the tasks so that the same result is achieved in the end. There is actually a whole field of computer science called concurrency, dedicated to how programs are written that do operations in parallel.

Note
Concurrency is a vast and complex field in computer science. This document provides a practical overview of techniques applicable to everyday scripting and automation tasks, rather than comprehensive theoretical coverage.

Operating System Process Management

What the Operating System Does

The operating system handles the many processes that run on a computer, providing fundamental concurrency capabilities that can be leveraged for parallel execution.

OS Process Management Capabilities:

Capability	Function	Benefit
Multi-core scheduling	Assigns processes to different CPU cores	True parallel execution
Time-slicing	Switches between processes on same core	Concurrent execution appearance
Memory isolation	Each process has own memory allocation	Process independence and safety
I/O management	Handles independent I/O calls per process	Non-blocking I/O across processes

Process Execution and CPU Distribution

If a computer has more than one core, the operating system can decide which processes get executed on which core. Regardless of the split between cores, all of these processes will be executing in parallel.

Process characteristics:

Each process has its own memory allocation
Each process does its own I/O calls
The OS decides what fraction of CPU time each process gets
The OS switches between processes as needed

Splitting Work Across Processes

Basic Approach

A very easy way to run operations in parallel is to split them across different processes, calling a script many times each with a different input set, and letting the operating system handle the concurrency.

Practical Example: Network Statistics Collection

Consider a scenario where statistics need to be collected on the current load and memory usage for all computers in a network.

Sequential approach:

Approach	Execution Pattern	Total Time
Single process	Connect to computer 1, wait → connect to computer 2, wait → …	Sum of all connection times
Serial execution	No parallelism	Maximum delay

A script can be written that connects to each computer in a list and gets the stats. Each connection takes a while to complete, so the total runtime of the script would be the sum of the time taken by all those connections.

Parallel approach:

Approach	Execution Pattern	Total Time
Multiple processes	Split into groups, run simultaneously	Approximately time for one group
Parallel execution	OS manages concurrency	Minimized idle time

Instead, the list of computers could be split into smaller groups and the OS used to call the script many times, once for each group. That way, the connections to the different computers can be started in parallel, which minimizes the time when the CPU isn’t doing anything.

Implementation comparison:

1# Sequential execution - slow
2python3 get_stats.py computer1 computer2 computer3 ... computer100
3
4# Parallel execution - fast
5python3 get_stats.py computer1-computer25 &
6python3 get_stats.py computer26-computer50 &
7python3 get_stats.py computer51-computer75 &
8python3 get_stats.py computer76-computer100 &
9wait

Important
This approach is super easy to implement and for many scripts, it will be the right choice. Process-level parallelism requires minimal code changes while providing significant performance benefits.

Balancing Workloads Across Processes

Complementary Resource Usage

Another easy optimization is to have a good balance of different workloads running on a computer.

Workload balance strategy:

Process Type	Primary Resource	Impact on Others
CPU-intensive	CPU utilization	Low impact on I/O processes
Network I/O-intensive	Network bandwidth	Low impact on CPU/disk processes
Disk I/O-intensive	Disk I/O	Low impact on CPU/network processes

If one process is using a lot of CPU while a different process is using a lot of network I/O and another process is using a lot of disk I/O, these can all run in parallel without interfering with each other.

Optimal workload mix:

1Process A: Heavy CPU computation (image processing)
2Process B: Network data transfer (downloading files)
3Process C: Disk operations (database indexing)
4
5Result: All three processes run efficiently in parallel
6Each uses different resources, minimizing contention

Threads for Shared Memory

When Processes Aren’t Enough

When using the OS to split work into processes, these processes don’t share any memory. Sometimes there might be a need to have some shared data between parallel tasks.

Understanding Threads

In that case, threads are used. Threads let parallel tasks run inside a process. This allows threads to share some of the memory with other threads in the same process.

Process vs Thread comparison:

Aspect	Process	Thread
Memory	Isolated, no sharing	Shared within same process
Management	OS-managed	Application-managed
Overhead	Higher (separate memory space)	Lower (shared memory space)
Communication	IPC mechanisms required	Direct memory access
Isolation	Complete isolation	Partial isolation
Creation cost	Expensive	Relatively cheap

Since threading isn’t handled by the OS in the same way, code must be modified to create and handle the threads.

Threading in Python

Available Modules

For threading implementation, the programming language’s threading capabilities must be examined. In Python, the Threading or AsyncIO modules can be used to accomplish this.

Python concurrency modules:

Module	Approach	Best For	Complexity
Threading	Thread-based parallelism	I/O-bound tasks	Moderate
AsyncIO	Async/await pattern	I/O-bound tasks, many connections	Moderate to High
Multiprocessing	Process-based parallelism	CPU-bound tasks	Moderate

These modules allow specification of:

Which parts of code to run in separate threads or as separate asynchronous events
How results of each should be combined in the end

Threading Implementation Consideration

One thing to watch out for is that depending on the actual threading implementation for the language being used, it might happen that all threads get executed in the same CPU processor.

Caution
Python’s Global Interpreter Lock (GIL) means that threading doesn’t provide true parallelism for CPU-bound tasks. If true multi-processor execution is needed, the code must be split into fully separate processes using the multiprocessing module.

When to use each approach:

Task Type	Recommended Approach	Reason
I/O-bound (network, disk)	Threading or AsyncIO	GIL released during I/O operations
CPU-bound (calculations)	Multiprocessing	True parallel execution on multiple cores
Mixed workload	Combination of both	Optimize each component appropriately

I/O-Bound vs CPU-Bound Operations

Understanding Operation Types

If a script is mostly just waiting on input or output (also known as I/O bound), it might not matter if it’s executed on one processor or eight.

I/O-bound characteristics:

Aspect	Behavior	Optimization Strategy
CPU usage	Low, mostly idle	Threading sufficient
Waiting time	High, blocked on I/O	Parallelize I/O operations
Scaling	Limited by I/O bandwidth	Increase concurrent connections

But parallelization might be done because all of the available CPU time is being used. In other words, the script is CPU-bound.

CPU-bound characteristics:

Aspect	Behavior	Optimization Strategy
CPU usage	High, constantly processing	Split across processors
Waiting time	Minimal, active computation	Use multiprocessing
Scaling	Limited by CPU cores	Add more cores or optimize algorithm

CPU-Bound Parallelization

In the case of CPU-bound operations, execution definitely needs to be split across processors to see performance benefits.

 1# I/O-bound example - threading works well
 2import threading
 3
 4def fetch_data(url):
 5    # Network I/O - CPU idle during wait
 6    response = requests.get(url)
 7    return response.text
 8
 9threads = [threading.Thread(target=fetch_data, args=(url,))
10           for url in urls]

 1# CPU-bound example - multiprocessing needed
 2import multiprocessing
 3
 4def compute_intensive_task(data):
 5    # Heavy computation - CPU constantly busy
 6    result = perform_complex_calculation(data)
 7    return result
 8
 9pool = multiprocessing.Pool(processes=4)
10results = pool.map(compute_intensive_task, data_chunks)

Finding the Right Balance

The Diminishing Returns Problem

There is a point where adding more parallel processes means things become even slower, not faster.

Over-parallelization consequences:

Scenario	Problem	Impact
Too many disk operations	Excessive seeking	More time repositioning than reading
Too many CPU operations	Excessive context switching	More time switching than computing
Too many network operations	Bandwidth saturation	Network congestion and timeouts

Disk I/O Over-Parallelization

If too many file read operations are attempted in parallel from disk, the disk might end up spending more time going from one position to another than actually retrieving the data.

Disk seek overhead:

Parallel Operations	Seek Time	Read Time	Efficiency
1-2 operations	Low	High	Good
5-10 operations	Moderate	Moderate	Acceptable
50+ operations	Very High	Low	Poor

CPU Over-Parallelization

If a ton of operations that use a lot of CPU are executed, the OS could spend more time switching between them than actually making progress in the calculations being attempted.

Context switching overhead:

Process Count	Switching Overhead	Useful Work	Performance
Equal to cores	Minimal	Maximum	Optimal
2× cores	Low	High	Good
10× cores	High	Moderate	Degraded
100× cores	Very High	Low	Poor

Optimization Strategy

When doing operations in parallel, the right balance of simultaneous actions must be found that lets computers stay busy without starving the system for resources.

Important
The optimal level of parallelism depends on the specific hardware, workload type, and resource availability. Profiling and benchmarking with different parallelism levels helps identify the sweet spot for each scenario.

Finding the optimal balance:

Resource Type	Recommended Approach	Starting Point
Disk I/O	Match to disk characteristics	2-4 concurrent operations per disk
Network I/O	Match to bandwidth and latency	10-50 concurrent connections
CPU-bound	Match to core count	Number of CPU cores or cores - 1

Real-World Case Study: Data Migration

The Problem

A data migration project required converting data stored in one format to a different format. There were many gigabytes of data that needed migrating, making manual processing impossible.

Initial Implementation

The first version of the script was taking an average of one hour per gigabyte migrated. This was much slower than expected, prompting investigation into optimization opportunities.

Initial performance:

Metric	Value	Assessment
Processing rate	1 hour per GB	Too slow
Approach	Sequential processing	Not utilizing available resources
Parallelism	None	Major opportunity for improvement

First Optimization: Thread-Based Parallelism

The logic was reorganized to have a separate thread per file, which decreased the total time to work through the files since it now wasn’t a linear process.

First optimization results:

Approach	Processing Pattern	Improvement
Sequential	One file at a time	Baseline (1 hour/GB)
Threaded	Multiple files concurrently	Significant improvement

Second Optimization: Multi-Machine Distribution

To make the process even faster, the work was split onto different machines, each running a bunch of threads.

Final architecture:

Component	Implementation	Contribution
Multiple machines	Distributed processing	Linear scaling with machines
Threads per machine	Concurrent file processing	Utilized I/O wait time
Combined approach	Horizontal + vertical scaling	Maximum resource utilization

Final Results

After all this rearranging to use available resources, the processing time was brought down to three minutes per gigabyte.

Performance improvement summary:

Version	Time per GB	Improvement Factor	Total Time for 100 GB
Initial	60 minutes	1× (baseline)	6,000 minutes (100 hours)
Final	3 minutes	20× faster	300 minutes (5 hours)

This represents a dramatic 20× performance improvement through proper application of concurrency and parallelism techniques.

Note
This real-world example demonstrates that significant performance improvements are achievable through parallelization. The combination of threading for I/O-bound operations and distribution across machines provided multiplicative benefits.

Concurrency Implementation Guidelines

Decision Framework

Question	I/O-Bound Answer	CPU-Bound Answer
What’s the bottleneck?	Waiting on I/O	CPU computation
Best parallelism approach?	Threading or AsyncIO	Multiprocessing
Optimal parallel tasks?	10-100+ depending on I/O	Equal to CPU cores
Shared data needed?	Use threads	Use processes with IPC
Primary optimization?	Maximize concurrent I/O	Minimize context switching

Implementation Steps

Step	Action	Purpose
1	Profile the application	Identify if I/O-bound or CPU-bound
2	Choose parallelism strategy	Threads/AsyncIO for I/O, processes for CPU
3	Start with conservative parallelism	2-4× parallelism factor
4	Benchmark performance	Measure actual improvement
5	Incrementally increase parallelism	Find optimal balance
6	Monitor resource utilization	Avoid over-parallelization

Best Practices

Practice	Description	Benefit
Start simple	Use OS-level process parallelism first	Minimal code changes
Profile before optimizing	Measure actual bottlenecks	Avoid premature optimization
Test incrementally	Increase parallelism gradually	Find optimal level
Monitor resources	Track CPU, memory, I/O usage	Prevent resource starvation
Consider workload mix	Balance CPU and I/O tasks	Maximize overall throughput

Conclusion

Parallelizing operations is a powerful technique for improving script performance, particularly when dealing with I/O-bound operations where the CPU sits idle during disk reads, network transfers, or other slow operations. The operating system provides fundamental concurrency capabilities through process management, allowing easy parallelization by splitting work across multiple script invocations with different input sets. For scenarios requiring shared data, threads enable parallel execution within a single process, though language-specific threading implementations may limit true multi-processor execution. Python offers Threading and AsyncIO modules for I/O-bound tasks and multiprocessing for CPU-bound operations that need true parallel execution across multiple cores. The distinction between I/O-bound and CPU-bound operations is critical: I/O-bound scripts benefit from threading or async patterns regardless of processor count, while CPU-bound scripts require splitting across multiple processors for performance gains. However, excessive parallelization can degrade performance when disk operations spend more time seeking than reading, or when the OS spends more time context switching than performing useful work. Finding the right balance requires understanding the workload characteristics and available resources. The data migration case study demonstrated a 20× performance improvement (from 60 minutes to 3 minutes per gigabyte) by reorganizing code to use threads per file and distributing work across multiple machines. When implementing parallelism, the decision framework should consider whether the bottleneck is I/O or CPU, whether shared data is needed, and what level of parallelism optimizes resource utilization without overwhelming the system. Starting with simple OS-level process parallelism, profiling to identify bottlenecks, and incrementally increasing parallelism while monitoring resources provides the most reliable path to performance optimization through concurrency.

FAQ

Script Profiling

Growing Complexity

Browse Courses