Using Threads to Improve Performance

This document demonstrates practical implementation of threading and multiprocessing in Python to optimize image processing performance. It walks through converting a sequential thumbnail generation script to use ThreadPoolExecutor and ProcessPoolExecutor, comparing their performance characteristics and explaining the differences caused by Python's Global Interpreter Lock.

This document provides a hands-on guide to implementing threading and multiprocessing in Python for performance optimization. Through a real-world image thumbnail generation scenario, it demonstrates converting sequential processing to parallel execution using ThreadPoolExecutor and ProcessPoolExecutor, measuring performance improvements, and understanding the differences between threads and processes in Python's execution model.


The Business Problem: E-Commerce Image Rebranding

The Scenario

A company has an e-commerce website that includes numerous images of products that are available for sale. An upcoming rebranding effort requires that all of these images be replaced with new ones.

Scope of work:

ComponentRequirement
Full-size imagesAll must be replaced
Thumbnail imagesAll must be regenerated
Image countTens of thousands
TimelineAs fast as possible

Existing Infrastructure

There is an existing script that creates thumbnails based on the full-size images. But there are a lot of files to process, and the script is taking a long time to finish.

Current situation:

AspectStatus
Script functionalityWorks correctly
PerformanceToo slow for large batches
Processing modelSequential (one at a time)
Optimization potentialHigh

Baseline Performance Measurement

Test Setup

The process begins by trying out the current script as-is using a set of 1,000 test images. There are more images to convert, but it will be easier to test the speed of the script with a smaller batch.

Test configuration:

ParameterValueRationale
Test image count1,000Representative sample
Total imagesTens of thousandsFull production workload
Testing approachIncrementalStart small, scale up

Measuring with the Time Command

The program is executed using the time command to see how long it takes.

1time python3 thumbnail_generator.py

Baseline results:

MetricValueInterpretation
Real time~2 secondsWall-clock time for 1,000 images
User time~1.9 secondsCPU time in user space
Sys time~0.1 secondsCPU time in system calls

Initial Assessment

It took about two seconds for 1,000 images. This doesn’t seem too slow, but there are tens of thousands of images that need converting, and the goal is to ensure that the process is as fast as possible.

Scaling projection:

Image CountEstimated Time (Sequential)Acceptability
1,0002 secondsGood
10,00020 secondsAcceptable
50,000100 seconds (~1.7 minutes)Slow
100,000200 seconds (~3.3 minutes)Too slow

The need for optimization is clear when considering the full scale of the task.


Implementing Thread-Based Parallelism

Understanding Python Threading

To make processing go faster by having it process the images in parallel, the implementation starts with importing the futures submodule, which is part of the concurrent module.

Python concurrent.futures module:

ComponentPurpose
concurrent.futuresHigh-level interface for parallel execution
ThreadPoolExecutorThread-based parallel execution
ProcessPoolExecutorProcess-based parallel execution
Simple APIEasy to use without complex threading code

This gives a very simple way of using Python threads.

Creating an Executor

To be able to run things in parallel, an executor must be created. This is the process that’s in charge of distributing the work among the different workers.

Executor role:

ResponsibilityDescription
Task schedulingDistributes work to workers
Worker managementCreates and manages thread/process pool
Result collectionGathers results from completed tasks
Resource cleanupShuts down workers when done

The futures module provides a couple of different executors: one for using threads and another for using processes.

Initial Implementation: ThreadPoolExecutor

The implementation will start with the ThreadPoolExecutor.

Code structure:

1from concurrent import futures
2
3# Create executor
4executor = futures.ThreadPoolExecutor()

ThreadPoolExecutor characteristics:

AspectSpecification
Worker typeThreads
Shared memoryYes (same process)
GIL impactThreads wait for GIL
Best forI/O-bound operations
OverheadLow

Modifying the Processing Loop

Original Sequential Code

Now the function that does most of the work in the loop is process_file.

Sequential pattern:

1for image_file in image_files:
2    process_file(image_file)  # Called directly - sequential

Sequential execution flow:

StepActionWaiting
1Process image 1Images 2-1000 wait
2Process image 2Images 3-1000 wait
1000Process image 1000None remaining

Parallel Implementation with Executor

Instead of calling the function directly in the loop, a new task is submitted to the executor with the name of the function and its parameters.

Parallel pattern:

1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6    executor.submit(process_file, image_file)

Parallel execution flow:

ThreadProcessingState
Thread 1Image 1, 5, 9, …Active
Thread 2Image 2, 6, 10, …Active
Thread 3Image 3, 7, 11, …Active
Thread 4Image 4, 8, 12, …Active

Understanding Task Scheduling

The for loop now creates a bunch of tasks that are all scheduled in the executor. The executor will run them in parallel using threads.

Task lifecycle:

StageStatusDescription
SubmitQueuedTask added to executor queue
ScheduleAssignedExecutor assigns to available thread
ExecuteRunningThread processes the image
CompleteDoneResult available (or error)

Handling Executor Completion

The Asynchronous Nature of Executors

An interesting thing that happens when using threads is that the loop will finish as soon as all tasks are scheduled. But it will still take a while until the tasks complete.

Timing comparison:

EventSequentialParallel (without wait)
Loop completionAfter all processingImmediately after scheduling
Actual work completionSame as loopContinues after loop
Script exitAfter all processingWould exit before completion!

Adding Proper Shutdown

A message is added indicating that the script is waiting for all threads to finish, and then the shutdown function is called on the executor.

1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6    executor.submit(process_file, image_file)
7
8print("Waiting for all threads to finish...")
9executor.shutdown()

Shutdown function behavior:

AspectBehavior
Wait for completionBlocks until all tasks done
New task submissionRaises error after shutdown called
Resource cleanupCloses threads properly
MandatoryYes, prevents premature exit

This function waits until all the workers in the pool are done, and only then shuts down the executor.


Thread-Based Performance Results

Running the Optimized Script

After making the change, the script is saved and tested.

1time python3 thumbnail_generator_threaded.py

Threading results:

MetricSequentialThreadedImprovement
Real time2.0 seconds1.2 seconds40% faster
User time1.9 seconds2.3 secondsHigher (expected)
Sys time0.1 seconds0.2 secondsSlightly higher

The script now takes 1.2 seconds. That’s a nice improvement over the two seconds observed before.

Understanding User Time vs Real Time

Notice how the user time is higher than the real time when using threads.

Time metric interpretation:

MetricThreading ValueMeaning
Real (wall-clock)1.2 secondsActual elapsed time
User2.3 secondsTotal CPU time across all cores
Sys0.2 secondsKernel CPU time

By using multiple threads, the script is making use of the different processors available in the computer. The user time value shows the time used on all processors combined.


Implementing Process-Based Parallelism

Switching to ProcessPoolExecutor

What will happen if processes are used instead of threads? This can be tested by changing the executor being used.

Simple code change:

1# Before: threads
2executor = futures.ThreadPoolExecutor()
3
4# After: processes
5executor = futures.ProcessPoolExecutor()

By changing the executor to the ProcessPoolExecutor, the futures module is instructed to use processes instead of threads for the parallel operations.

Executor comparison:

AspectThreadPoolExecutorProcessPoolExecutor
Worker typeThreadsProcesses
MemorySharedIsolated
GIL impactAffectedNot affected
CommunicationDirectIPC required
Startup overheadLowHigher

Process-Based Performance Results

Running with Processes

After saving the change, the script is tested again.

1time python3 thumbnail_generator_multiprocess.py

Process-based results:

MetricSequentialThreadedMultiprocessBest Improvement
Real time2.0 s1.2 s<1.0 s2× faster
User time1.9 s2.3 s3.5+ sHigher still
Sys time0.1 s0.2 s0.3 sIncreased overhead

This is now taking less than a second to finish, and the user time has gone up even more.

Why Processes Are Faster

This is because, by using processes, even more use is being made of the CPU.

CPU utilization comparison:

ApproachCPU UtilizationParallelism Level
Sequential~25% (1 core)None
Threaded~60-70%Partial (GIL limited)
Multiprocess~95%+ (all cores)Full

Understanding the Thread vs Process Difference

Python’s Global Interpreter Lock (GIL)

The difference is caused by the way threads and processes work in Python.

The GIL mechanism:

AspectDescriptionImpact
PurposePrevents race conditionsThread safety
MechanismOnly one thread executes Python bytecode at a timeSerialization
ScopePer-processEach process has own GIL
When releasedDuring I/O operationsAllows I/O concurrency

Thread Synchronization Overhead

Threads use a bunch of safety features to avoid having two threads that try to write to the same variable.

Thread safety mechanisms:

MechanismPurposeCost
GIL acquisitionPrevent concurrent bytecode executionContext switches
Lock waitingSerialize access to shared dataWaiting time
Variable synchronizationEnsure data consistencyMemory barriers

And this means that when using threads, they may end up waiting for their turn to write to variables for a few milliseconds, adding up to the small difference between the two approaches.

Process Independence

Process advantages:

AspectBenefitTrade-off
Separate GILEach process has own GILHigher memory usage
True parallelismNo GIL contentionProcess creation overhead
CPU-bound tasksFull CPU utilizationIPC complexity for shared data

When each approach excels:

Workload TypeBest ChoiceReason
I/O-bound (network, disk)ThreadPoolExecutorGIL released during I/O, lower overhead
CPU-bound (computation)ProcessPoolExecutorNo GIL contention, true parallelism
Mixed workloadDepends on bottleneckProfile to determine

Performance Comparison Summary

All Approaches Compared

Complete performance analysis:

ApproachReal TimeUser TimeSpeedupBest For
Sequential2.0 s1.9 s1× (baseline)Simple scripts
ThreadPoolExecutor1.2 s2.3 s1.67×I/O-bound tasks
ProcessPoolExecutor<1.0 s3.5+ s2+×CPU-bound tasks

Scaling to Full Workload

Projected performance for full image set:

Image CountSequentialThreadedMultiprocessTime Saved
1,0002.0 s1.2 s0.9 s1.1 s
10,00020 s12 s9 s11 s
50,000100 s60 s45 s55 s
100,000200 s120 s90 s110 s (~2 min)

The multiprocess approach saves significant time at scale.


Additional Optimization Opportunities

Further Improvements

There are still more improvements that can be made to the script.

Potential enhancements:

EnhancementBenefitComplexity
Check if thumbnail existsSkip unnecessary workLow
Check if thumbnail is up to dateAvoid regenerating current thumbnailsLow
Add progress bar during processingBetter user feedbackModerate
Second progress bar while waitingShow completion progressModerate
Batch processingMemory efficiency for huge datasetsModerate
Error handling per imageResilience to corrupted filesModerate

Checking Thumbnail Status

Adding a check to see if the thumbnail exists and is up to date before doing the conversion could significantly reduce work.

Smart processing logic:

1def should_process(image_file, thumbnail_file):
2    if not os.path.exists(thumbnail_file):
3        return True  # Thumbnail doesn't exist
4
5    image_mtime = os.path.getmtime(image_file)
6    thumb_mtime = os.path.getmtime(thumbnail_file)
7
8    return image_mtime > thumb_mtime  # Source newer than thumbnail

Impact of smart checking:

ScenarioImages to ProcessTime Saved
Fresh run (no thumbnails)100%None
Re-run (all current)0%100%
Partial update (10% changed)10%90%

Progress Feedback

Adding a progress bar while waiting for tasks to finish makes it clear that the script is doing its job.

Progress bar benefits:

BenefitUser Impact
Visual feedbackReduces perceived wait time
Completion estimateSets expectations
Debugging aidShows if script is stuck
Professional appearanceImproved user experience

Implementation Best Practices

Executor Pattern Guidelines

PracticeRecommendationReason
Always use context managerwith executor: instead of manual shutdownAutomatic cleanup
Set max_workers explicitlyMatch to available coresControl resource usage
Handle exceptionsWrap tasks in try-exceptPrevent silent failures
Monitor memoryEspecially with ProcessPoolExecutorAvoid OOM errors

Code Example with Best Practices

 1from concurrent import futures
 2import os
 3
 4def process_file(image_file):
 5    """Process a single image file to create thumbnail"""
 6    try:
 7        # Create thumbnail logic here
 8        thumbnail_file = get_thumbnail_path(image_file)
 9
10        # Skip if already up to date
11        if is_thumbnail_current(image_file, thumbnail_file):
12            return f"Skipped: {image_file}"
13
14        # Generate thumbnail
15        create_thumbnail(image_file, thumbnail_file)
16        return f"Processed: {image_file}"
17    except Exception as e:
18        return f"Error processing {image_file}: {e}"
19
20# Use context manager for automatic cleanup
21with futures.ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
22    # Submit all tasks
23    future_to_file = {
24        executor.submit(process_file, img): img
25        for img in image_files
26    }
27
28    # Process results as they complete
29    for future in futures.as_completed(future_to_file):
30        result = future.result()
31        print(result)

Key Takeaways

Threading vs Multiprocessing Decision Matrix

ConsiderThreadPoolExecutorProcessPoolExecutor
Task is I/O-bound✓ Best choiceOverkill
Task is CPU-boundLimited benefit✓ Best choice
Need shared memory✓ EasyComplex (requires IPC)
Low overhead desired✓ Low overheadHigher overhead
Maximum CPU usagePartial (GIL limited)✓ Full utilization

Performance Optimization Workflow

StepActionGoal
1Measure baselineKnow starting point
2Identify bottleneckCPU-bound vs I/O-bound
3Choose parallelism strategyThreads or processes
4Implement executor patternAdd concurrency
5Measure improvementVerify benefits
6Add smart featuresSkip unnecessary work
7Monitor at scaleEnsure stability

Conclusion

This hands-on example demonstrated practical implementation of threading and multiprocessing in Python for performance optimization of an image thumbnail generation script. The baseline sequential processing took 2 seconds for 1,000 images, which would scale poorly to the full workload of tens of thousands of images. Implementing ThreadPoolExecutor by importing the concurrent.futures module, creating an executor, and submitting tasks instead of calling functions directly improved performance to 1.2 seconds (40% faster), with the executor.shutdown() call ensuring all tasks complete before script exit. Switching to ProcessPoolExecutor with a single line change (replacing ThreadPoolExecutor with ProcessPoolExecutor) further improved performance to under 1 second (2× faster than baseline), demonstrating the superiority of processes for CPU-bound tasks. The performance difference between threads and processes stems from Python’s Global Interpreter Lock (GIL), which causes threads to wait for their turn to execute bytecode, introducing millisecond delays that accumulate across many operations, while processes each have their own GIL and achieve true parallelism. The user time exceeding real time when using parallelism indicates successful multi-core utilization, with the user time representing combined CPU time across all cores. For CPU-bound tasks like image processing, ProcessPoolExecutor is the superior choice, while ThreadPoolExecutor works well for I/O-bound operations where the GIL is released during I/O waits. Additional optimization opportunities include checking if thumbnails already exist and are current before processing, adding progress bars for user feedback, implementing proper error handling per image, and using context managers for automatic executor cleanup. The demonstrated executor pattern—submit tasks in a loop, then shutdown to wait for completion—provides a simple yet powerful approach to parallelism that can dramatically improve performance for batch processing tasks with minimal code changes.


FAQ