Using Threads to Improve Performance

November 12, 2025 12 min read Programming Performance Concurrency Docs Automation-With-Python Threading Multiprocessing Python-Concurrency Performance-Optimization

This document demonstrates practical implementation of threading and multiprocessing in Python to optimize image processing performance. It walks through converting a sequential thumbnail generation script to use ThreadPoolExecutor and ProcessPoolExecutor, comparing their performance characteristics and explaining the differences caused by Python's Global Interpreter Lock.

On this page

This document provides a hands-on guide to implementing threading and multiprocessing in Python for performance optimization. Through a real-world image thumbnail generation scenario, it demonstrates converting sequential processing to parallel execution using ThreadPoolExecutor and ProcessPoolExecutor, measuring performance improvements, and understanding the differences between threads and processes in Python's execution model.

The Business Problem: E-Commerce Image Rebranding

The Scenario

A company has an e-commerce website that includes numerous images of products that are available for sale. An upcoming rebranding effort requires that all of these images be replaced with new ones.

Scope of work:

Component	Requirement
Full-size images	All must be replaced
Thumbnail images	All must be regenerated
Image count	Tens of thousands
Timeline	As fast as possible

Existing Infrastructure

There is an existing script that creates thumbnails based on the full-size images. But there are a lot of files to process, and the script is taking a long time to finish.

Current situation:

Aspect	Status
Script functionality	Works correctly
Performance	Too slow for large batches
Processing model	Sequential (one at a time)
Optimization potential	High

Baseline Performance Measurement

Test Setup

The process begins by trying out the current script as-is using a set of 1,000 test images. There are more images to convert, but it will be easier to test the speed of the script with a smaller batch.

Test configuration:

Parameter	Value	Rationale
Test image count	1,000	Representative sample
Total images	Tens of thousands	Full production workload
Testing approach	Incremental	Start small, scale up

Measuring with the Time Command

The program is executed using the time command to see how long it takes.

1time python3 thumbnail_generator.py

Baseline results:

Metric	Value	Interpretation
Real time	~2 seconds	Wall-clock time for 1,000 images
User time	~1.9 seconds	CPU time in user space
Sys time	~0.1 seconds	CPU time in system calls

Initial Assessment

It took about two seconds for 1,000 images. This doesn’t seem too slow, but there are tens of thousands of images that need converting, and the goal is to ensure that the process is as fast as possible.

Scaling projection:

Image Count	Estimated Time (Sequential)	Acceptability
1,000	2 seconds	Good
10,000	20 seconds	Acceptable
50,000	100 seconds (~1.7 minutes)	Slow
100,000	200 seconds (~3.3 minutes)	Too slow

The need for optimization is clear when considering the full scale of the task.

Implementing Thread-Based Parallelism

Understanding Python Threading

To make processing go faster by having it process the images in parallel, the implementation starts with importing the futures submodule, which is part of the concurrent module.

Python concurrent.futures module:

Component	Purpose
`concurrent.futures`	High-level interface for parallel execution
`ThreadPoolExecutor`	Thread-based parallel execution
`ProcessPoolExecutor`	Process-based parallel execution
Simple API	Easy to use without complex threading code

This gives a very simple way of using Python threads.

Creating an Executor

To be able to run things in parallel, an executor must be created. This is the process that’s in charge of distributing the work among the different workers.

Executor role:

Responsibility	Description
Task scheduling	Distributes work to workers
Worker management	Creates and manages thread/process pool
Result collection	Gathers results from completed tasks
Resource cleanup	Shuts down workers when done

The futures module provides a couple of different executors: one for using threads and another for using processes.

Initial Implementation: ThreadPoolExecutor

The implementation will start with the ThreadPoolExecutor.

Code structure:

1from concurrent import futures
2
3# Create executor
4executor = futures.ThreadPoolExecutor()

ThreadPoolExecutor characteristics:

Aspect	Specification
Worker type	Threads
Shared memory	Yes (same process)
GIL impact	Threads wait for GIL
Best for	I/O-bound operations
Overhead	Low

Modifying the Processing Loop

Original Sequential Code

Now the function that does most of the work in the loop is process_file.

Sequential pattern:

1for image_file in image_files:
2    process_file(image_file)  # Called directly - sequential

Sequential execution flow:

Step	Action	Waiting
1	Process image 1	Images 2-1000 wait
2	Process image 2	Images 3-1000 wait
…	…	…
1000	Process image 1000	None remaining

Parallel Implementation with Executor

Instead of calling the function directly in the loop, a new task is submitted to the executor with the name of the function and its parameters.

Parallel pattern:

1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6    executor.submit(process_file, image_file)

Parallel execution flow:

Thread	Processing	State
Thread 1	Image 1, 5, 9, …	Active
Thread 2	Image 2, 6, 10, …	Active
Thread 3	Image 3, 7, 11, …	Active
Thread 4	Image 4, 8, 12, …	Active

Understanding Task Scheduling

The for loop now creates a bunch of tasks that are all scheduled in the executor. The executor will run them in parallel using threads.

Task lifecycle:

Stage	Status	Description
Submit	Queued	Task added to executor queue
Schedule	Assigned	Executor assigns to available thread
Execute	Running	Thread processes the image
Complete	Done	Result available (or error)

Handling Executor Completion

The Asynchronous Nature of Executors

An interesting thing that happens when using threads is that the loop will finish as soon as all tasks are scheduled. But it will still take a while until the tasks complete.

Timing comparison:

Event	Sequential	Parallel (without wait)
Loop completion	After all processing	Immediately after scheduling
Actual work completion	Same as loop	Continues after loop
Script exit	After all processing	Would exit before completion!

Adding Proper Shutdown

A message is added indicating that the script is waiting for all threads to finish, and then the shutdown function is called on the executor.

1from concurrent import futures
2
3executor = futures.ThreadPoolExecutor()
4
5for image_file in image_files:
6    executor.submit(process_file, image_file)
7
8print("Waiting for all threads to finish...")
9executor.shutdown()

Shutdown function behavior:

Aspect	Behavior
Wait for completion	Blocks until all tasks done
New task submission	Raises error after shutdown called
Resource cleanup	Closes threads properly
Mandatory	Yes, prevents premature exit

This function waits until all the workers in the pool are done, and only then shuts down the executor.

Thread-Based Performance Results

Running the Optimized Script

After making the change, the script is saved and tested.

1time python3 thumbnail_generator_threaded.py

Threading results:

Metric	Sequential	Threaded	Improvement
Real time	2.0 seconds	1.2 seconds	40% faster
User time	1.9 seconds	2.3 seconds	Higher (expected)
Sys time	0.1 seconds	0.2 seconds	Slightly higher

The script now takes 1.2 seconds. That’s a nice improvement over the two seconds observed before.

Understanding User Time vs Real Time

Notice how the user time is higher than the real time when using threads.

Time metric interpretation:

Metric	Threading Value	Meaning
Real (wall-clock)	1.2 seconds	Actual elapsed time
User	2.3 seconds	Total CPU time across all cores
Sys	0.2 seconds	Kernel CPU time

By using multiple threads, the script is making use of the different processors available in the computer. The user time value shows the time used on all processors combined.

Note
When user time exceeds real time, it indicates successful parallel execution across multiple CPU cores. The user time represents the sum of CPU time used by all threads, while real time is the actual elapsed wall-clock time.

Implementing Process-Based Parallelism

Switching to ProcessPoolExecutor

What will happen if processes are used instead of threads? This can be tested by changing the executor being used.

Simple code change:

1# Before: threads
2executor = futures.ThreadPoolExecutor()
3
4# After: processes
5executor = futures.ProcessPoolExecutor()

By changing the executor to the ProcessPoolExecutor, the futures module is instructed to use processes instead of threads for the parallel operations.

Executor comparison:

Aspect	ThreadPoolExecutor	ProcessPoolExecutor
Worker type	Threads	Processes
Memory	Shared	Isolated
GIL impact	Affected	Not affected
Communication	Direct	IPC required
Startup overhead	Low	Higher

Process-Based Performance Results

Running with Processes

After saving the change, the script is tested again.

1time python3 thumbnail_generator_multiprocess.py

Process-based results:

Metric	Sequential	Threaded	Multiprocess	Best Improvement
Real time	2.0 s	1.2 s	<1.0 s	2× faster
User time	1.9 s	2.3 s	3.5+ s	Higher still
Sys time	0.1 s	0.2 s	0.3 s	Increased overhead

This is now taking less than a second to finish, and the user time has gone up even more.

Why Processes Are Faster

This is because, by using processes, even more use is being made of the CPU.

CPU utilization comparison:

Approach	CPU Utilization	Parallelism Level
Sequential	~25% (1 core)	None
Threaded	~60-70%	Partial (GIL limited)
Multiprocess	~95%+ (all cores)	Full

Understanding the Thread vs Process Difference

Python’s Global Interpreter Lock (GIL)

The difference is caused by the way threads and processes work in Python.

The GIL mechanism:

Aspect	Description	Impact
Purpose	Prevents race conditions	Thread safety
Mechanism	Only one thread executes Python bytecode at a time	Serialization
Scope	Per-process	Each process has own GIL
When released	During I/O operations	Allows I/O concurrency

Thread Synchronization Overhead

Threads use a bunch of safety features to avoid having two threads that try to write to the same variable.

Thread safety mechanisms:

Mechanism	Purpose	Cost
GIL acquisition	Prevent concurrent bytecode execution	Context switches
Lock waiting	Serialize access to shared data	Waiting time
Variable synchronization	Ensure data consistency	Memory barriers

And this means that when using threads, they may end up waiting for their turn to write to variables for a few milliseconds, adding up to the small difference between the two approaches.

Process Independence

Process advantages:

Aspect	Benefit	Trade-off
Separate GIL	Each process has own GIL	Higher memory usage
True parallelism	No GIL contention	Process creation overhead
CPU-bound tasks	Full CPU utilization	IPC complexity for shared data

When each approach excels:

Workload Type	Best Choice	Reason
I/O-bound (network, disk)	ThreadPoolExecutor	GIL released during I/O, lower overhead
CPU-bound (computation)	ProcessPoolExecutor	No GIL contention, true parallelism
Mixed workload	Depends on bottleneck	Profile to determine

Important
For CPU-bound tasks like image processing, ProcessPoolExecutor provides true parallel execution and significantly better performance than ThreadPoolExecutor. For I/O-bound tasks, ThreadPoolExecutor is usually sufficient and has lower overhead.

Performance Comparison Summary

All Approaches Compared

Complete performance analysis:

Approach	Real Time	User Time	Speedup	Best For
Sequential	2.0 s	1.9 s	1× (baseline)	Simple scripts
ThreadPoolExecutor	1.2 s	2.3 s	1.67×	I/O-bound tasks
ProcessPoolExecutor	<1.0 s	3.5+ s	2+×	CPU-bound tasks

Scaling to Full Workload

Projected performance for full image set:

Image Count	Sequential	Threaded	Multiprocess	Time Saved
1,000	2.0 s	1.2 s	0.9 s	1.1 s
10,000	20 s	12 s	9 s	11 s
50,000	100 s	60 s	45 s	55 s
100,000	200 s	120 s	90 s	110 s (~2 min)

The multiprocess approach saves significant time at scale.

Additional Optimization Opportunities

Further Improvements

There are still more improvements that can be made to the script.

Potential enhancements:

Enhancement	Benefit	Complexity
Check if thumbnail exists	Skip unnecessary work	Low
Check if thumbnail is up to date	Avoid regenerating current thumbnails	Low
Add progress bar during processing	Better user feedback	Moderate
Second progress bar while waiting	Show completion progress	Moderate
Batch processing	Memory efficiency for huge datasets	Moderate
Error handling per image	Resilience to corrupted files	Moderate

Checking Thumbnail Status

Adding a check to see if the thumbnail exists and is up to date before doing the conversion could significantly reduce work.

Smart processing logic:

1def should_process(image_file, thumbnail_file):
2    if not os.path.exists(thumbnail_file):
3        return True  # Thumbnail doesn't exist
4
5    image_mtime = os.path.getmtime(image_file)
6    thumb_mtime = os.path.getmtime(thumbnail_file)
7
8    return image_mtime > thumb_mtime  # Source newer than thumbnail

Impact of smart checking:

Scenario	Images to Process	Time Saved
Fresh run (no thumbnails)	100%	None
Re-run (all current)	0%	100%
Partial update (10% changed)	10%	90%

Progress Feedback

Adding a progress bar while waiting for tasks to finish makes it clear that the script is doing its job.

Progress bar benefits:

Benefit	User Impact
Visual feedback	Reduces perceived wait time
Completion estimate	Sets expectations
Debugging aid	Shows if script is stuck
Professional appearance	Improved user experience

Implementation Best Practices

Executor Pattern Guidelines

Practice	Recommendation	Reason
Always use context manager	`with executor:` instead of manual shutdown	Automatic cleanup
Set max_workers explicitly	Match to available cores	Control resource usage
Handle exceptions	Wrap tasks in try-except	Prevent silent failures
Monitor memory	Especially with ProcessPoolExecutor	Avoid OOM errors

Code Example with Best Practices

 1from concurrent import futures
 2import os
 3
 4def process_file(image_file):
 5    """Process a single image file to create thumbnail"""
 6    try:
 7        # Create thumbnail logic here
 8        thumbnail_file = get_thumbnail_path(image_file)
 9
10        # Skip if already up to date
11        if is_thumbnail_current(image_file, thumbnail_file):
12            return f"Skipped: {image_file}"
13
14        # Generate thumbnail
15        create_thumbnail(image_file, thumbnail_file)
16        return f"Processed: {image_file}"
17    except Exception as e:
18        return f"Error processing {image_file}: {e}"
19
20# Use context manager for automatic cleanup
21with futures.ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
22    # Submit all tasks
23    future_to_file = {
24        executor.submit(process_file, img): img
25        for img in image_files
26    }
27
28    # Process results as they complete
29    for future in futures.as_completed(future_to_file):
30        result = future.result()
31        print(result)

Key Takeaways

Threading vs Multiprocessing Decision Matrix

Consider	ThreadPoolExecutor	ProcessPoolExecutor
Task is I/O-bound	✓ Best choice	Overkill
Task is CPU-bound	Limited benefit	✓ Best choice
Need shared memory	✓ Easy	Complex (requires IPC)
Low overhead desired	✓ Low overhead	Higher overhead
Maximum CPU usage	Partial (GIL limited)	✓ Full utilization

Performance Optimization Workflow

Step	Action	Goal
1	Measure baseline	Know starting point
2	Identify bottleneck	CPU-bound vs I/O-bound
3	Choose parallelism strategy	Threads or processes
4	Implement executor pattern	Add concurrency
5	Measure improvement	Verify benefits
6	Add smart features	Skip unnecessary work
7	Monitor at scale	Ensure stability

Conclusion

This hands-on example demonstrated practical implementation of threading and multiprocessing in Python for performance optimization of an image thumbnail generation script. The baseline sequential processing took 2 seconds for 1,000 images, which would scale poorly to the full workload of tens of thousands of images. Implementing ThreadPoolExecutor by importing the concurrent.futures module, creating an executor, and submitting tasks instead of calling functions directly improved performance to 1.2 seconds (40% faster), with the executor.shutdown() call ensuring all tasks complete before script exit. Switching to ProcessPoolExecutor with a single line change (replacing ThreadPoolExecutor with ProcessPoolExecutor) further improved performance to under 1 second (2× faster than baseline), demonstrating the superiority of processes for CPU-bound tasks. The performance difference between threads and processes stems from Python’s Global Interpreter Lock (GIL), which causes threads to wait for their turn to execute bytecode, introducing millisecond delays that accumulate across many operations, while processes each have their own GIL and achieve true parallelism. The user time exceeding real time when using parallelism indicates successful multi-core utilization, with the user time representing combined CPU time across all cores. For CPU-bound tasks like image processing, ProcessPoolExecutor is the superior choice, while ThreadPoolExecutor works well for I/O-bound operations where the GIL is released during I/O waits. Additional optimization opportunities include checking if thumbnails already exist and are current before processing, adding progress bars for user feedback, implementing proper error handling per image, and using context managers for automatic executor cleanup. The demonstrated executor pattern—submit tasks in a loop, then shutdown to wait for completion—provides a simple yet powerful approach to parallelism that can dramatically improve performance for batch processing tasks with minimal code changes.

FAQ

System Complexity

Using Threads

Browse Courses