Parallelizing Operations for Performance

This document explores concurrency and parallel execution techniques to improve script performance. It covers operating system process management splitting work across processes and threads, understanding I/O-bound versus CPU-bound operations, and finding the optimal balance of parallel tasks to maximize resource utilization without system degradation.

This document examines concurrency and parallel execution as performance optimization strategies. It explains how operating systems manage processes, demonstrates techniques for splitting work across multiple processes and threads, distinguishes between I/O-bound and CPU-bound operations, and provides guidance on balancing parallelism to maximize throughput without overwhelming system resources.


The Problem with Blocking I/O Operations

Understanding Blocked Execution

Reading information from disk or transferring data over the network is a slow operation. In typical scripts, while this operation is ongoing, nothing else happens. The script is blocked, waiting for input or output while the CPU sits idle.

Operation TypeTypical SpeedCPU State During OperationOpportunity
Disk I/OMilliseconds to secondsIdle, waitingCan perform other work
Network I/OMilliseconds to secondsIdle, waitingCan perform other work
CPU computationMicroseconds to millisecondsActive, processingLimited parallelism opportunity

The Parallelization Solution

One way to improve performance is to do operations in parallel. That way, while the computer is waiting for slow I/O, other work can take place.

The tricky part is dividing up the tasks so that the same result is achieved in the end. There is actually a whole field of computer science called concurrency, dedicated to how programs are written that do operations in parallel.


Operating System Process Management

What the Operating System Does

The operating system handles the many processes that run on a computer, providing fundamental concurrency capabilities that can be leveraged for parallel execution.

OS Process Management Capabilities:

CapabilityFunctionBenefit
Multi-core schedulingAssigns processes to different CPU coresTrue parallel execution
Time-slicingSwitches between processes on same coreConcurrent execution appearance
Memory isolationEach process has own memory allocationProcess independence and safety
I/O managementHandles independent I/O calls per processNon-blocking I/O across processes

Process Execution and CPU Distribution

If a computer has more than one core, the operating system can decide which processes get executed on which core. Regardless of the split between cores, all of these processes will be executing in parallel.

Process characteristics:

  • Each process has its own memory allocation
  • Each process does its own I/O calls
  • The OS decides what fraction of CPU time each process gets
  • The OS switches between processes as needed

Splitting Work Across Processes

Basic Approach

A very easy way to run operations in parallel is to split them across different processes, calling a script many times each with a different input set, and letting the operating system handle the concurrency.

Practical Example: Network Statistics Collection

Consider a scenario where statistics need to be collected on the current load and memory usage for all computers in a network.

Sequential approach:

ApproachExecution PatternTotal Time
Single processConnect to computer 1, wait → connect to computer 2, wait → …Sum of all connection times
Serial executionNo parallelismMaximum delay

A script can be written that connects to each computer in a list and gets the stats. Each connection takes a while to complete, so the total runtime of the script would be the sum of the time taken by all those connections.

Parallel approach:

ApproachExecution PatternTotal Time
Multiple processesSplit into groups, run simultaneouslyApproximately time for one group
Parallel executionOS manages concurrencyMinimized idle time

Instead, the list of computers could be split into smaller groups and the OS used to call the script many times, once for each group. That way, the connections to the different computers can be started in parallel, which minimizes the time when the CPU isn’t doing anything.

Implementation comparison:

1# Sequential execution - slow
2python3 get_stats.py computer1 computer2 computer3 ... computer100
3
4# Parallel execution - fast
5python3 get_stats.py computer1-computer25 &
6python3 get_stats.py computer26-computer50 &
7python3 get_stats.py computer51-computer75 &
8python3 get_stats.py computer76-computer100 &
9wait

Balancing Workloads Across Processes

Complementary Resource Usage

Another easy optimization is to have a good balance of different workloads running on a computer.

Workload balance strategy:

Process TypePrimary ResourceImpact on Others
CPU-intensiveCPU utilizationLow impact on I/O processes
Network I/O-intensiveNetwork bandwidthLow impact on CPU/disk processes
Disk I/O-intensiveDisk I/OLow impact on CPU/network processes

If one process is using a lot of CPU while a different process is using a lot of network I/O and another process is using a lot of disk I/O, these can all run in parallel without interfering with each other.

Optimal workload mix:

1Process A: Heavy CPU computation (image processing)
2Process B: Network data transfer (downloading files)
3Process C: Disk operations (database indexing)
4
5Result: All three processes run efficiently in parallel
6Each uses different resources, minimizing contention

Threads for Shared Memory

When Processes Aren’t Enough

When using the OS to split work into processes, these processes don’t share any memory. Sometimes there might be a need to have some shared data between parallel tasks.

Understanding Threads

In that case, threads are used. Threads let parallel tasks run inside a process. This allows threads to share some of the memory with other threads in the same process.

Process vs Thread comparison:

AspectProcessThread
MemoryIsolated, no sharingShared within same process
ManagementOS-managedApplication-managed
OverheadHigher (separate memory space)Lower (shared memory space)
CommunicationIPC mechanisms requiredDirect memory access
IsolationComplete isolationPartial isolation
Creation costExpensiveRelatively cheap

Since threading isn’t handled by the OS in the same way, code must be modified to create and handle the threads.


Threading in Python

Available Modules

For threading implementation, the programming language’s threading capabilities must be examined. In Python, the Threading or AsyncIO modules can be used to accomplish this.

Python concurrency modules:

ModuleApproachBest ForComplexity
ThreadingThread-based parallelismI/O-bound tasksModerate
AsyncIOAsync/await patternI/O-bound tasks, many connectionsModerate to High
MultiprocessingProcess-based parallelismCPU-bound tasksModerate

These modules allow specification of:

  • Which parts of code to run in separate threads or as separate asynchronous events
  • How results of each should be combined in the end

Threading Implementation Consideration

One thing to watch out for is that depending on the actual threading implementation for the language being used, it might happen that all threads get executed in the same CPU processor.

When to use each approach:

Task TypeRecommended ApproachReason
I/O-bound (network, disk)Threading or AsyncIOGIL released during I/O operations
CPU-bound (calculations)MultiprocessingTrue parallel execution on multiple cores
Mixed workloadCombination of bothOptimize each component appropriately

I/O-Bound vs CPU-Bound Operations

Understanding Operation Types

If a script is mostly just waiting on input or output (also known as I/O bound), it might not matter if it’s executed on one processor or eight.

I/O-bound characteristics:

AspectBehaviorOptimization Strategy
CPU usageLow, mostly idleThreading sufficient
Waiting timeHigh, blocked on I/OParallelize I/O operations
ScalingLimited by I/O bandwidthIncrease concurrent connections

But parallelization might be done because all of the available CPU time is being used. In other words, the script is CPU-bound.

CPU-bound characteristics:

AspectBehaviorOptimization Strategy
CPU usageHigh, constantly processingSplit across processors
Waiting timeMinimal, active computationUse multiprocessing
ScalingLimited by CPU coresAdd more cores or optimize algorithm

CPU-Bound Parallelization

In the case of CPU-bound operations, execution definitely needs to be split across processors to see performance benefits.

 1# I/O-bound example - threading works well
 2import threading
 3
 4def fetch_data(url):
 5    # Network I/O - CPU idle during wait
 6    response = requests.get(url)
 7    return response.text
 8
 9threads = [threading.Thread(target=fetch_data, args=(url,))
10           for url in urls]
 1# CPU-bound example - multiprocessing needed
 2import multiprocessing
 3
 4def compute_intensive_task(data):
 5    # Heavy computation - CPU constantly busy
 6    result = perform_complex_calculation(data)
 7    return result
 8
 9pool = multiprocessing.Pool(processes=4)
10results = pool.map(compute_intensive_task, data_chunks)

Finding the Right Balance

The Diminishing Returns Problem

There is a point where adding more parallel processes means things become even slower, not faster.

Over-parallelization consequences:

ScenarioProblemImpact
Too many disk operationsExcessive seekingMore time repositioning than reading
Too many CPU operationsExcessive context switchingMore time switching than computing
Too many network operationsBandwidth saturationNetwork congestion and timeouts

Disk I/O Over-Parallelization

If too many file read operations are attempted in parallel from disk, the disk might end up spending more time going from one position to another than actually retrieving the data.

Disk seek overhead:

Parallel OperationsSeek TimeRead TimeEfficiency
1-2 operationsLowHighGood
5-10 operationsModerateModerateAcceptable
50+ operationsVery HighLowPoor

CPU Over-Parallelization

If a ton of operations that use a lot of CPU are executed, the OS could spend more time switching between them than actually making progress in the calculations being attempted.

Context switching overhead:

Process CountSwitching OverheadUseful WorkPerformance
Equal to coresMinimalMaximumOptimal
2× coresLowHighGood
10× coresHighModerateDegraded
100× coresVery HighLowPoor

Optimization Strategy

When doing operations in parallel, the right balance of simultaneous actions must be found that lets computers stay busy without starving the system for resources.

Finding the optimal balance:

Resource TypeRecommended ApproachStarting Point
Disk I/OMatch to disk characteristics2-4 concurrent operations per disk
Network I/OMatch to bandwidth and latency10-50 concurrent connections
CPU-boundMatch to core countNumber of CPU cores or cores - 1

Real-World Case Study: Data Migration

The Problem

A data migration project required converting data stored in one format to a different format. There were many gigabytes of data that needed migrating, making manual processing impossible.

Initial Implementation

The first version of the script was taking an average of one hour per gigabyte migrated. This was much slower than expected, prompting investigation into optimization opportunities.

Initial performance:

MetricValueAssessment
Processing rate1 hour per GBToo slow
ApproachSequential processingNot utilizing available resources
ParallelismNoneMajor opportunity for improvement

First Optimization: Thread-Based Parallelism

The logic was reorganized to have a separate thread per file, which decreased the total time to work through the files since it now wasn’t a linear process.

First optimization results:

ApproachProcessing PatternImprovement
SequentialOne file at a timeBaseline (1 hour/GB)
ThreadedMultiple files concurrentlySignificant improvement

Second Optimization: Multi-Machine Distribution

To make the process even faster, the work was split onto different machines, each running a bunch of threads.

Final architecture:

ComponentImplementationContribution
Multiple machinesDistributed processingLinear scaling with machines
Threads per machineConcurrent file processingUtilized I/O wait time
Combined approachHorizontal + vertical scalingMaximum resource utilization

Final Results

After all this rearranging to use available resources, the processing time was brought down to three minutes per gigabyte.

Performance improvement summary:

VersionTime per GBImprovement FactorTotal Time for 100 GB
Initial60 minutes1× (baseline)6,000 minutes (100 hours)
Final3 minutes20× faster300 minutes (5 hours)

This represents a dramatic 20× performance improvement through proper application of concurrency and parallelism techniques.


Concurrency Implementation Guidelines

Decision Framework

QuestionI/O-Bound AnswerCPU-Bound Answer
What’s the bottleneck?Waiting on I/OCPU computation
Best parallelism approach?Threading or AsyncIOMultiprocessing
Optimal parallel tasks?10-100+ depending on I/OEqual to CPU cores
Shared data needed?Use threadsUse processes with IPC
Primary optimization?Maximize concurrent I/OMinimize context switching

Implementation Steps

StepActionPurpose
1Profile the applicationIdentify if I/O-bound or CPU-bound
2Choose parallelism strategyThreads/AsyncIO for I/O, processes for CPU
3Start with conservative parallelism2-4× parallelism factor
4Benchmark performanceMeasure actual improvement
5Incrementally increase parallelismFind optimal balance
6Monitor resource utilizationAvoid over-parallelization

Best Practices

PracticeDescriptionBenefit
Start simpleUse OS-level process parallelism firstMinimal code changes
Profile before optimizingMeasure actual bottlenecksAvoid premature optimization
Test incrementallyIncrease parallelism graduallyFind optimal level
Monitor resourcesTrack CPU, memory, I/O usagePrevent resource starvation
Consider workload mixBalance CPU and I/O tasksMaximize overall throughput

Conclusion

Parallelizing operations is a powerful technique for improving script performance, particularly when dealing with I/O-bound operations where the CPU sits idle during disk reads, network transfers, or other slow operations. The operating system provides fundamental concurrency capabilities through process management, allowing easy parallelization by splitting work across multiple script invocations with different input sets. For scenarios requiring shared data, threads enable parallel execution within a single process, though language-specific threading implementations may limit true multi-processor execution. Python offers Threading and AsyncIO modules for I/O-bound tasks and multiprocessing for CPU-bound operations that need true parallel execution across multiple cores. The distinction between I/O-bound and CPU-bound operations is critical: I/O-bound scripts benefit from threading or async patterns regardless of processor count, while CPU-bound scripts require splitting across multiple processors for performance gains. However, excessive parallelization can degrade performance when disk operations spend more time seeking than reading, or when the OS spends more time context switching than performing useful work. Finding the right balance requires understanding the workload characteristics and available resources. The data migration case study demonstrated a 20× performance improvement (from 60 minutes to 3 minutes per gigabyte) by reorganizing code to use threads per file and distributing work across multiple machines. When implementing parallelism, the decision framework should consider whether the bottleneck is I/O or CPU, whether shared data is needed, and what level of parallelism optimizes resource utilization without overwhelming the system. Starting with simple OS-level process parallelism, profiling to identify bottlenecks, and incrementally increasing parallelism while monitoring resources provides the most reliable path to performance optimization through concurrency.


FAQ