This document demonstrates practical implementation of threading and multiprocessing in Python to optimize image processing performance. It walks through converting a sequential thumbnail generation script to use ThreadPoolExecutor and ProcessPoolExecutor, comparing their performance characteristics and explaining the differences caused by Python's Global Interpreter Lock.
This document provides a hands-on guide to implementing threading and multiprocessing in Python for performance optimization. Through a real-world image thumbnail generation scenario, it demonstrates converting sequential processing to parallel execution using ThreadPoolExecutor and ProcessPoolExecutor, measuring performance improvements, and understanding the differences between threads and processes in Python's execution model.
A company has an e-commerce website that includes numerous images of products that are available for sale. An upcoming rebranding effort requires that all of these images be replaced with new ones.
Scope of work:
| Component | Requirement |
|---|---|
| Full-size images | All must be replaced |
| Thumbnail images | All must be regenerated |
| Image count | Tens of thousands |
| Timeline | As fast as possible |
There is an existing script that creates thumbnails based on the full-size images. But there are a lot of files to process, and the script is taking a long time to finish.
Current situation:
| Aspect | Status |
|---|---|
| Script functionality | Works correctly |
| Performance | Too slow for large batches |
| Processing model | Sequential (one at a time) |
| Optimization potential | High |
The process begins by trying out the current script as-is using a set of 1,000 test images. There are more images to convert, but it will be easier to test the speed of the script with a smaller batch.
Test configuration:
| Parameter | Value | Rationale |
|---|---|---|
| Test image count | 1,000 | Representative sample |
| Total images | Tens of thousands | Full production workload |
| Testing approach | Incremental | Start small, scale up |
The program is executed using the time command to see how long it takes.
1time python3 thumbnail_generator.py
Baseline results:
| Metric | Value | Interpretation |
|---|---|---|
| Real time | ~2 seconds | Wall-clock time for 1,000 images |
| User time | ~1.9 seconds | CPU time in user space |
| Sys time | ~0.1 seconds | CPU time in system calls |
It took about two seconds for 1,000 images. This doesn’t seem too slow, but there are tens of thousands of images that need converting, and the goal is to ensure that the process is as fast as possible.
Scaling projection:
| Image Count | Estimated Time (Sequential) | Acceptability |
|---|---|---|
| 1,000 | 2 seconds | Good |
| 10,000 | 20 seconds | Acceptable |
| 50,000 | 100 seconds (~1.7 minutes) | Slow |
| 100,000 | 200 seconds (~3.3 minutes) | Too slow |
The need for optimization is clear when considering the full scale of the task.
To make processing go faster by having it process the images in parallel, the implementation starts with importing the futures submodule, which is part of the concurrent module.
Python concurrent.futures module:
| Component | Purpose |
|---|---|
concurrent.futures | High-level interface for parallel execution |
ThreadPoolExecutor | Thread-based parallel execution |
ProcessPoolExecutor | Process-based parallel execution |
| Simple API | Easy to use without complex threading code |
This gives a very simple way of using Python threads.
To be able to run things in parallel, an executor must be created. This is the process that’s in charge of distributing the work among the different workers.
Executor role:
| Responsibility | Description |
|---|---|
| Task scheduling | Distributes work to workers |
| Worker management | Creates and manages thread/process pool |
| Result collection | Gathers results from completed tasks |
| Resource cleanup | Shuts down workers when done |
The futures module provides a couple of different executors: one for using threads and another for using processes.
The implementation will start with the ThreadPoolExecutor.
Code structure:
1from concurrent import futures
2
3# Create executor
4executor = futures.ThreadPoolExecutor()
ThreadPoolExecutor characteristics:
| Aspect | Specification |
|---|---|
| Worker type | Threads |
| Shared memory | Yes (same process) |
| GIL impact | Threads wait for GIL |
| Best for | I/O-bound operations |
| Overhead | Low |
Now the function that does most of the work in the loop is process_file.
Sequential pattern:
1for image_file in image_files:
2 process_file(image_file) # Called directly - sequential
Sequential execution flow:
| Step | Action | Waiting |
|---|---|---|
| 1 | Process image 1 | Images 2-1000 wait |
| 2 | Process image 2 | Images 3-1000 wait |
| … | … | … |
| 1000 | Process image 1000 | None remaining |
Instead of calling the function directly in the loop, a new task is submitted to the executor with the name of the function and its parameters.
Parallel pattern:
1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6 executor.submit(process_file, image_file)
Parallel execution flow:
| Thread | Processing | State |
|---|---|---|
| Thread 1 | Image 1, 5, 9, … | Active |
| Thread 2 | Image 2, 6, 10, … | Active |
| Thread 3 | Image 3, 7, 11, … | Active |
| Thread 4 | Image 4, 8, 12, … | Active |
The for loop now creates a bunch of tasks that are all scheduled in the executor. The executor will run them in parallel using threads.
Task lifecycle:
| Stage | Status | Description |
|---|---|---|
| Submit | Queued | Task added to executor queue |
| Schedule | Assigned | Executor assigns to available thread |
| Execute | Running | Thread processes the image |
| Complete | Done | Result available (or error) |
An interesting thing that happens when using threads is that the loop will finish as soon as all tasks are scheduled. But it will still take a while until the tasks complete.
Timing comparison:
| Event | Sequential | Parallel (without wait) |
|---|---|---|
| Loop completion | After all processing | Immediately after scheduling |
| Actual work completion | Same as loop | Continues after loop |
| Script exit | After all processing | Would exit before completion! |
A message is added indicating that the script is waiting for all threads to finish, and then the shutdown function is called on the executor.
1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6 executor.submit(process_file, image_file)
7
8print("Waiting for all threads to finish...")
9executor.shutdown()
Shutdown function behavior:
| Aspect | Behavior |
|---|---|
| Wait for completion | Blocks until all tasks done |
| New task submission | Raises error after shutdown called |
| Resource cleanup | Closes threads properly |
| Mandatory | Yes, prevents premature exit |
This function waits until all the workers in the pool are done, and only then shuts down the executor.
After making the change, the script is saved and tested.
1time python3 thumbnail_generator_threaded.py
Threading results:
| Metric | Sequential | Threaded | Improvement |
|---|---|---|---|
| Real time | 2.0 seconds | 1.2 seconds | 40% faster |
| User time | 1.9 seconds | 2.3 seconds | Higher (expected) |
| Sys time | 0.1 seconds | 0.2 seconds | Slightly higher |
The script now takes 1.2 seconds. That’s a nice improvement over the two seconds observed before.
Notice how the user time is higher than the real time when using threads.
Time metric interpretation:
| Metric | Threading Value | Meaning |
|---|---|---|
| Real (wall-clock) | 1.2 seconds | Actual elapsed time |
| User | 2.3 seconds | Total CPU time across all cores |
| Sys | 0.2 seconds | Kernel CPU time |
By using multiple threads, the script is making use of the different processors available in the computer. The user time value shows the time used on all processors combined.
Note
When user time exceeds real time, it indicates successful parallel execution across multiple CPU cores. The user time represents the sum of CPU time used by all threads, while real time is the actual elapsed wall-clock time.
What will happen if processes are used instead of threads? This can be tested by changing the executor being used.
Simple code change:
1# Before: threads
2executor = futures.ThreadPoolExecutor()
3
4# After: processes
5executor = futures.ProcessPoolExecutor()
By changing the executor to the ProcessPoolExecutor, the futures module is instructed to use processes instead of threads for the parallel operations.
Executor comparison:
| Aspect | ThreadPoolExecutor | ProcessPoolExecutor |
|---|---|---|
| Worker type | Threads | Processes |
| Memory | Shared | Isolated |
| GIL impact | Affected | Not affected |
| Communication | Direct | IPC required |
| Startup overhead | Low | Higher |
After saving the change, the script is tested again.
1time python3 thumbnail_generator_multiprocess.py
Process-based results:
| Metric | Sequential | Threaded | Multiprocess | Best Improvement |
|---|---|---|---|---|
| Real time | 2.0 s | 1.2 s | <1.0 s | 2× faster |
| User time | 1.9 s | 2.3 s | 3.5+ s | Higher still |
| Sys time | 0.1 s | 0.2 s | 0.3 s | Increased overhead |
This is now taking less than a second to finish, and the user time has gone up even more.
This is because, by using processes, even more use is being made of the CPU.
CPU utilization comparison:
| Approach | CPU Utilization | Parallelism Level |
|---|---|---|
| Sequential | ~25% (1 core) | None |
| Threaded | ~60-70% | Partial (GIL limited) |
| Multiprocess | ~95%+ (all cores) | Full |
The difference is caused by the way threads and processes work in Python.
The GIL mechanism:
| Aspect | Description | Impact |
|---|---|---|
| Purpose | Prevents race conditions | Thread safety |
| Mechanism | Only one thread executes Python bytecode at a time | Serialization |
| Scope | Per-process | Each process has own GIL |
| When released | During I/O operations | Allows I/O concurrency |
Threads use a bunch of safety features to avoid having two threads that try to write to the same variable.
Thread safety mechanisms:
| Mechanism | Purpose | Cost |
|---|---|---|
| GIL acquisition | Prevent concurrent bytecode execution | Context switches |
| Lock waiting | Serialize access to shared data | Waiting time |
| Variable synchronization | Ensure data consistency | Memory barriers |
And this means that when using threads, they may end up waiting for their turn to write to variables for a few milliseconds, adding up to the small difference between the two approaches.
Process advantages:
| Aspect | Benefit | Trade-off |
|---|---|---|
| Separate GIL | Each process has own GIL | Higher memory usage |
| True parallelism | No GIL contention | Process creation overhead |
| CPU-bound tasks | Full CPU utilization | IPC complexity for shared data |
When each approach excels:
| Workload Type | Best Choice | Reason |
|---|---|---|
| I/O-bound (network, disk) | ThreadPoolExecutor | GIL released during I/O, lower overhead |
| CPU-bound (computation) | ProcessPoolExecutor | No GIL contention, true parallelism |
| Mixed workload | Depends on bottleneck | Profile to determine |
Important
For CPU-bound tasks like image processing, ProcessPoolExecutor provides true parallel execution and significantly better performance than ThreadPoolExecutor. For I/O-bound tasks, ThreadPoolExecutor is usually sufficient and has lower overhead.
Complete performance analysis:
| Approach | Real Time | User Time | Speedup | Best For |
|---|---|---|---|---|
| Sequential | 2.0 s | 1.9 s | 1× (baseline) | Simple scripts |
| ThreadPoolExecutor | 1.2 s | 2.3 s | 1.67× | I/O-bound tasks |
| ProcessPoolExecutor | <1.0 s | 3.5+ s | 2+× | CPU-bound tasks |
Projected performance for full image set:
| Image Count | Sequential | Threaded | Multiprocess | Time Saved |
|---|---|---|---|---|
| 1,000 | 2.0 s | 1.2 s | 0.9 s | 1.1 s |
| 10,000 | 20 s | 12 s | 9 s | 11 s |
| 50,000 | 100 s | 60 s | 45 s | 55 s |
| 100,000 | 200 s | 120 s | 90 s | 110 s (~2 min) |
The multiprocess approach saves significant time at scale.
There are still more improvements that can be made to the script.
Potential enhancements:
| Enhancement | Benefit | Complexity |
|---|---|---|
| Check if thumbnail exists | Skip unnecessary work | Low |
| Check if thumbnail is up to date | Avoid regenerating current thumbnails | Low |
| Add progress bar during processing | Better user feedback | Moderate |
| Second progress bar while waiting | Show completion progress | Moderate |
| Batch processing | Memory efficiency for huge datasets | Moderate |
| Error handling per image | Resilience to corrupted files | Moderate |
Adding a check to see if the thumbnail exists and is up to date before doing the conversion could significantly reduce work.
Smart processing logic:
1def should_process(image_file, thumbnail_file):
2 if not os.path.exists(thumbnail_file):
3 return True # Thumbnail doesn't exist
4
5 image_mtime = os.path.getmtime(image_file)
6 thumb_mtime = os.path.getmtime(thumbnail_file)
7
8 return image_mtime > thumb_mtime # Source newer than thumbnail
Impact of smart checking:
| Scenario | Images to Process | Time Saved |
|---|---|---|
| Fresh run (no thumbnails) | 100% | None |
| Re-run (all current) | 0% | 100% |
| Partial update (10% changed) | 10% | 90% |
Adding a progress bar while waiting for tasks to finish makes it clear that the script is doing its job.
Progress bar benefits:
| Benefit | User Impact |
|---|---|
| Visual feedback | Reduces perceived wait time |
| Completion estimate | Sets expectations |
| Debugging aid | Shows if script is stuck |
| Professional appearance | Improved user experience |
| Practice | Recommendation | Reason |
|---|---|---|
| Always use context manager | with executor: instead of manual shutdown | Automatic cleanup |
| Set max_workers explicitly | Match to available cores | Control resource usage |
| Handle exceptions | Wrap tasks in try-except | Prevent silent failures |
| Monitor memory | Especially with ProcessPoolExecutor | Avoid OOM errors |
1from concurrent import futures
2import os
3
4def process_file(image_file):
5 """Process a single image file to create thumbnail"""
6 try:
7 # Create thumbnail logic here
8 thumbnail_file = get_thumbnail_path(image_file)
9
10 # Skip if already up to date
11 if is_thumbnail_current(image_file, thumbnail_file):
12 return f"Skipped: {image_file}"
13
14 # Generate thumbnail
15 create_thumbnail(image_file, thumbnail_file)
16 return f"Processed: {image_file}"
17 except Exception as e:
18 return f"Error processing {image_file}: {e}"
19
20# Use context manager for automatic cleanup
21with futures.ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
22 # Submit all tasks
23 future_to_file = {
24 executor.submit(process_file, img): img
25 for img in image_files
26 }
27
28 # Process results as they complete
29 for future in futures.as_completed(future_to_file):
30 result = future.result()
31 print(result)
| Consider | ThreadPoolExecutor | ProcessPoolExecutor |
|---|---|---|
| Task is I/O-bound | ✓ Best choice | Overkill |
| Task is CPU-bound | Limited benefit | ✓ Best choice |
| Need shared memory | ✓ Easy | Complex (requires IPC) |
| Low overhead desired | ✓ Low overhead | Higher overhead |
| Maximum CPU usage | Partial (GIL limited) | ✓ Full utilization |
| Step | Action | Goal |
|---|---|---|
| 1 | Measure baseline | Know starting point |
| 2 | Identify bottleneck | CPU-bound vs I/O-bound |
| 3 | Choose parallelism strategy | Threads or processes |
| 4 | Implement executor pattern | Add concurrency |
| 5 | Measure improvement | Verify benefits |
| 6 | Add smart features | Skip unnecessary work |
| 7 | Monitor at scale | Ensure stability |
This hands-on example demonstrated practical implementation of threading and multiprocessing in Python for performance optimization of an image thumbnail generation script. The baseline sequential processing took 2 seconds for 1,000 images, which would scale poorly to the full workload of tens of thousands of images. Implementing ThreadPoolExecutor by importing the concurrent.futures module, creating an executor, and submitting tasks instead of calling functions directly improved performance to 1.2 seconds (40% faster), with the executor.shutdown() call ensuring all tasks complete before script exit. Switching to ProcessPoolExecutor with a single line change (replacing ThreadPoolExecutor with ProcessPoolExecutor) further improved performance to under 1 second (2× faster than baseline), demonstrating the superiority of processes for CPU-bound tasks. The performance difference between threads and processes stems from Python’s Global Interpreter Lock (GIL), which causes threads to wait for their turn to execute bytecode, introducing millisecond delays that accumulate across many operations, while processes each have their own GIL and achieve true parallelism. The user time exceeding real time when using parallelism indicates successful multi-core utilization, with the user time representing combined CPU time across all cores. For CPU-bound tasks like image processing, ProcessPoolExecutor is the superior choice, while ThreadPoolExecutor works well for I/O-bound operations where the GIL is released during I/O waits. Additional optimization opportunities include checking if thumbnails already exist and are current before processing, adding progress bars for user feedback, implementing proper error handling per image, and using context managers for automatic executor cleanup. The demonstrated executor pattern—submit tasks in a loop, then shutdown to wait for completion—provides a simple yet powerful approach to parallelism that can dramatically improve performance for batch processing tasks with minimal code changes.