This document addresses disk space management challenges in IT systems covering common causes of disk exhaustion from logs to temporary files. It explores diagnostic techniques for identifying space usage patterns, handling deleted but open files, and implementing preventive strategies to avoid disk-related performance degradation and data loss.
This document examines disk space management as a critical system resource, exploring how programs consume storage through binaries, data, caches, logs, and temporary files. It covers diagnostic approaches for identifying space usage patterns, understanding performance degradation as disks fill up, and implementing strategies to prevent disk exhaustion that can cause application crashes and potential data loss.
Another resource that might need attention is the disk usage of computers. Programs may need disk space for lots of different reasons.
Common disk space consumers:
| Type | Purpose | Growth Pattern | Cleanup Frequency |
|---|---|---|---|
| Installed binaries | Application executables | Stable | On uninstall |
| Libraries | Shared code dependencies | Stable | On uninstall |
| Application data | User and system data | Growing | User-driven |
| Cache information | Performance optimization | Growing/stable | Periodic |
| Logs | System and application events | Continuously growing | Rotation-based |
| Temporary files | Intermediate processing | Varies | Should be automatic |
| Backups | Data redundancy | Growing | Retention policy |
If a computer is running out of space, it’s possible that there’s an attempt to store too much data in too little space.
Space exhaustion scenarios:
| Scenario | Cause | Likelihood | Solution Type |
|---|---|---|---|
| Legitimate growth | Too many applications or large files | Common | Add storage capacity |
| Program misbehavior | Temporary files not cleaned | Very common | Fix cleanup logic |
| Log overflow | Excessive logging without rotation | Common | Configure log rotation |
| Cache accumulation | No cache eviction policy | Moderate | Implement cache limits |
| Backup retention | Old backups never deleted | Moderate | Set retention policy |
Maybe there are too many applications installed, or an attempt to store too many large files in the drive.
But it’s also possible that programs are misusing the space allotted to them, like by keeping temporary files or caching information that doesn’t get cleaned up quickly enough or at all.
Misuse patterns:
| Misuse Type | Behavior | Impact Timeline | Detection Method |
|---|---|---|---|
| Temporary file retention | Never deleting temp files | Days to weeks | Directory size monitoring |
| Unbounded caching | Cache grows indefinitely | Weeks to months | Cache directory analysis |
| Excessive logging | High-frequency log writes | Hours to days | Log file growth rate |
| Failed cleanup | Crash prevents deletion | Per crash | Orphaned file detection |
It’s common for the overall performance of the system to decrease as the available disk space gets smaller.
Performance degradation stages:
| Disk Usage | Available Space | Performance Impact | User Experience |
|---|---|---|---|
| 0-50% | Plenty free | Normal | Fast operations |
| 50-80% | Moderate free | Slight slowdown | Barely noticeable |
| 80-95% | Low free | Noticeable slowdown | Delays apparent |
| 95-100% | Critical/none | Severe degradation | Very slow/crashes |
Data starts getting fragmented across the disk, and operations become slower.
Fragmentation effects:
| Aspect | Unfragmented Disk | Fragmented Disk | Impact |
|---|---|---|---|
| File location | Contiguous blocks | Scattered blocks | Read time increases |
| Seek operations | Minimal | Many | Head movement delays |
| Write efficiency | Sequential | Random | Slower writes |
| Free space | Large contiguous blocks | Many small gaps | Allocation overhead |
Fragmentation performance comparison:
| Operation | Contiguous File | Fragmented File (10 pieces) | Slowdown Factor |
|---|---|---|---|
| Sequential read | 1 seek + read | 10 seeks + reads | 5-10× slower |
| Random access | Direct access | Multiple seeks | 3-5× slower |
| File opening | Fast | Slow (map fragments) | 2-4× slower |
When a hard drive is full, programs may suddenly crash while trying to write something into disk and finding out that they can’t.
Write failure scenarios:
| Operation | Expected Behavior | Full Disk Behavior | Result |
|---|---|---|---|
| Log write | Append to file | Write fails | Application crash |
| Save document | Update file | No space error | Work lost |
| Create temp file | Allocate space | Allocation fails | Process terminates |
| Database commit | Write transaction | Commit fails | Data inconsistency |
A full hard drive might even lead to data loss, as some programs might truncate a file before writing an updated version of it, and then fail to write the new content, losing all the data that was stored in it before.
Data loss patterns:
| Update Pattern | Step 1 | Step 2 | Full Disk Result |
|---|---|---|---|
| Safe (atomic) | Write to temp file | Rename over original | Temp write fails, original intact |
| Unsafe (truncate) | Truncate original | Write new data | Truncate succeeds, write fails, data lost |
| In-place | Seek to position | Overwrite data | Partial write, corrupted file |
Warning
A full disk can cause catastrophic data loss when applications truncate files before writing updates. If the truncation succeeds but the write fails due to no space, all original data is permanently lost. Always monitor disk space to prevent reaching this critical state.
If it gets to this point, errors like “no space left on the device” will probably be seen when running applications or in the logs.
Common disk full errors:
| Error Message | Context | Severity |
|---|---|---|
| “No space left on device” | Linux/Unix systems | Critical |
| “Disk full” | General error | Critical |
| “ENOSPC: no space left” | Node.js/JavaScript | Critical |
| “IOError: [Errno 28]” | Python | Critical |
| “Insufficient disk space” | Windows | Critical |
So what should be done if a computer runs out of disk space? If it’s a user machine, it might be easily fixed by uninstalling applications that aren’t used, or cleaning up old data that isn’t needed anymore.
User machine cleanup approaches:
| Cleanup Type | Target | Impact | Difficulty |
|---|---|---|---|
| Uninstall apps | Unused applications | High (GB) | Easy |
| Delete downloads | Old download files | Moderate (GB) | Easy |
| Clear caches | Browser, app caches | Moderate (MB-GB) | Easy |
| Remove duplicates | Duplicate files | Varies | Moderate |
| Archive old files | Old documents, photos | High (GB) | Moderate |
But if it’s a server, a closer look at what’s going on might be needed. Is the issue that an extra drive needs to be added to the server to have more available space, or is it that some application is misbehaving and filling the disk with useless data?
Server diagnostic questions:
| Question | Indicates | Action Required |
|---|---|---|
| Is growth expected? | Legitimate data increase | Add storage capacity |
| Is one directory dominant? | Concentrated issue | Investigate specific application |
| Are files temporary/logs? | Cleanup problem | Fix cleanup processes |
| Is growth rate abnormal? | Application misbehavior | Debug application |
| Are backups accumulating? | Retention issue | Adjust backup policy |
To figure this out, examining how the space is being used and what directories are taking up the most space is needed, then drilling down until finding out whether large chunks of space are taken by valid information or by files that should be purged.
Analysis workflow:
| Step | Command Example | Purpose | Output | ||
|---|---|---|---|---|---|
| 1. Top-level overview | df -h | Show filesystem usage | Total/used/available per mount | ||
| 2. Directory breakdown | du -sh /* | Identify large directories | Size of top-level dirs | ||
| 3. Drill down | du -sh /var/* | Investigate suspect dir | Subdirectory sizes | ||
| 4. Find large files | find / -size +1G | Locate specific culprits | Files over threshold | ||
| 5. Sort by size | `du -h | sort -rh | head -20` | Rank consumers | Top 20 space users |
Common disk usage commands:
1# Check overall disk usage
2df -h
3
4# Find directories using most space
5du -sh /* | sort -rh | head -10
6
7# Find large files over 100MB
8find / -type f -size +100M -exec ls -lh {} \;
9
10# Check disk usage by directory, sorted
11du -h /var | sort -rh | head -20
12
13# Find files modified in last 7 days
14find /var/log -mtime -7 -type f -exec du -sh {} \; | sort -rh
For example, on a database server, it’s expected that the bulk of the disk space is going to be used by the data stored in the database. On a mail server, it’s going to be the mailboxes of the users of that service.
Expected space usage by server type:
| Server Type | Expected Primary Consumer | Typical Size | Anomaly Threshold |
|---|---|---|---|
| Database | Database files (/var/lib/mysql) | 50-90% of disk | Logs >10% |
| User mailboxes (/var/mail) | 60-90% of disk | Temp files >5% | |
| Web | Static content (/var/www) | 30-60% of disk | Logs >20% |
| File | Shared files (/shares) | 70-95% of disk | System >5% |
| Application | Application data | 40-70% of disk | Logs >15% |
But if most of the data is found to be stored in logs or in temporary files, something has gone wrong.
Anomalous usage indicators:
| Directory | Normal Size | Anomalous Size | Likely Issue |
|---|---|---|---|
| /var/log | <5% disk | >20% disk | Log rotation failure |
| /tmp | <2% disk | >10% disk | Temp file cleanup failure |
| /var/cache | <10% disk | >30% disk | Cache eviction not working |
| /var/spool | <5% disk | >15% disk | Queue processing stuck |
One common pattern of misbehavior is a program that keeps logging error messages to the system log over and over. This can happen for lots of different reasons.
Excessive logging scenarios:
| Cause | Frequency | Growth Rate | Example |
|---|---|---|---|
| Configuration error | Continuous retries | MB to GB per hour | Service fails to start |
| Network timeout | Per request | GB per day | API endpoint down |
| Permission denied | Per access attempt | MB per hour | File access failure |
| Dependency failure | Per health check | GB per day | Database unreachable |
For example, the OS might keep trying to start a program that fails because of a configuration problem. This will generate a new log entry with every retry and can take up a lot of space if there are several retries per second.
Retry pattern impact:
| Retry Rate | Log Entry Size | Space Used Per Hour | Space Used Per Day |
|---|---|---|---|
| 1 per second | 200 bytes | ~700 KB | ~17 MB |
| 10 per second | 200 bytes | ~7 MB | ~168 MB |
| 100 per second | 200 bytes | ~70 MB | ~1.7 GB |
| 1000 per second | 200 bytes | ~700 MB | ~17 GB |
Example error log loop:
1Nov 11 10:15:01 server systemd[1]: Starting myapp.service...
2Nov 11 10:15:01 server myapp[1234]: Configuration file not found: /etc/myapp/config.yml
3Nov 11 10:15:01 server systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE
4Nov 11 10:15:01 server systemd[1]: myapp.service: Failed with result 'exit-code'.
5Nov 11 10:15:02 server systemd[1]: Starting myapp.service...
6Nov 11 10:15:02 server myapp[1235]: Configuration file not found: /etc/myapp/config.yml
7# ... repeats thousands of times ...
Or it could be that the server has a lot of activity and the logs are real, but there are just too many of them.
High-activity logging management:
| Activity Level | Logs Per Day | Rotation Strategy | Retention Period |
|---|---|---|---|
| Low | <100 MB | Weekly rotation | 30 days |
| Moderate | 100 MB - 1 GB | Daily rotation | 7-14 days |
| High | 1-10 GB | Hourly rotation | 3-7 days |
| Very high | >10 GB | Continuous/size-based | 1-3 days |
In that case, tweaking the configuration of the tools that rotate the logs more frequently might be needed to make sure that only what’s needed is being kept.
Log rotation configuration strategies:
| Strategy | Configuration | Benefit | Trade-off |
|---|---|---|---|
| Size-based | Rotate when >100MB | Predictable disk usage | Uneven time periods |
| Time-based | Rotate daily at midnight | Regular schedule | Variable file sizes |
| Compression | Gzip old logs | Save 80-90% space | CPU overhead |
| Remote shipping | Send to log server | Local disk protected | Network dependency |
| Reduced verbosity | Lower log level | Less data written | Less debugging info |
Example logrotate configuration:
1/var/log/myapp/*.log {
2 daily # Rotate daily
3 rotate 7 # Keep 7 days
4 compress # Compress old logs
5 delaycompress # Don't compress most recent
6 missingok # Don't error if log missing
7 notifempty # Don't rotate if empty
8 create 0644 root root # Create new file with permissions
9 size 100M # Also rotate if >100MB
10 postrotate
11 systemctl reload myapp
12 endscript
13}
In other cases, the disk might get full due to a program generating large temporary files and then failing to clean those up.
Temporary file problems:
| Problem Type | Cause | Accumulation Rate | Detection |
|---|---|---|---|
| Crash cleanup failure | Process killed unexpectedly | Per crash | Growing /tmp directory |
| Programming error | Missing cleanup code | Continuous | Temp files with old timestamps |
| Failed cleanup logic | Error in cleanup routine | Varies | Files matching temp pattern |
| Partial processing | Job interrupted | Per failed job | Incomplete file sets |
For example, an application might clean up temporary files when shutting down cleanly, but leave them behind if it crashes.
Cleanup behavior comparison:
| Exit Type | Cleanup Trigger | Temp Files | Result |
|---|---|---|---|
| Normal shutdown | Exit handler called | Deleted | Clean /tmp |
| Graceful signal (SIGTERM) | Signal handler | Deleted | Clean /tmp |
| Kill signal (SIGKILL) | None | Remain | /tmp accumulates |
| Crash/exception | May not execute | Remain | /tmp accumulates |
| Power loss | None | Remain | /tmp accumulates |
Or it could simply be a programming error of creating temporary files and never cleaning them up.
Programming error patterns:
1# Bad: Temporary file never cleaned up
2def process_data(input_file):
3 temp_file = "/tmp/processing_" + str(time.time())
4 with open(temp_file, 'w') as f:
5 # Process data...
6 f.write(processed_data)
7 # File left behind forever
8 return result
9
10# Good: Explicit cleanup
11def process_data_better(input_file):
12 temp_file = "/tmp/processing_" + str(time.time())
13 try:
14 with open(temp_file, 'w') as f:
15 f.write(processed_data)
16 # Process temp file
17 result = process(temp_file)
18 finally:
19 if os.path.exists(temp_file):
20 os.remove(temp_file)
21 return result
22
23# Best: Use tempfile module
24import tempfile
25
26def process_data_best(input_file):
27 with tempfile.NamedTemporaryFile(mode='w', delete=True) as temp_file:
28 temp_file.write(processed_data)
29 temp_file.flush()
30 result = process(temp_file.name)
31 # File automatically deleted when context exits
32 return result
In a case like this, ideally there would be some housekeeping to fix the program and delete those files correctly.
Temporary file solutions:
| Solution | Approach | Permanence | Effort |
|---|---|---|---|
| Fix application | Correct cleanup code | Permanent | High |
| Add signal handlers | Catch SIGTERM | Permanent | Medium |
| Use proper temp APIs | tempfile module/mktemp | Permanent | Medium |
| Cleanup script | Scheduled deletion | Workaround | Low |
| tmpfs mount | RAM-based /tmp | System reboot cleans | Medium |
But if that’s not possible, writing a custom script that gets rid of them might be needed.
Cleanup script example:
1#!/bin/bash
2# cleanup_temp_files.sh - Remove old temporary files
3
4# Find and delete temp files older than 7 days
5find /tmp -type f -name "processing_*" -mtime +7 -delete
6
7# Find and delete empty directories
8find /tmp -type d -empty -delete
9
10# Log cleanup action
11echo "$(date): Cleaned up old temporary files" >> /var/log/temp_cleanup.log
Cron job for automated cleanup:
1# Run cleanup script daily at 2 AM
20 2 * * * /usr/local/bin/cleanup_temp_files.sh
A situation that might be tricky to debug is when the files taking up the space are deleted files.
How deleted files consume space:
| File State | Visible in Listing | Disk Space Used | Process Access |
|---|---|---|---|
| Normal open file | Yes | Yes | Yes |
| Deleted but open | No | Yes (still allocated) | Yes (via file descriptor) |
| Closed & deleted | No | No (freed) | No |
If a program opens a file, the OS lets that program read and write in the file regardless of whether the file is marked as deleted or not.
So lots of programs delete the temporary files they create right after opening to avoid issues with failing to clean them up later.
Temporary file lifecycle with immediate deletion:
| Step | Action | File Visible | Disk Space | Process Access |
|---|---|---|---|---|
| 1. Create | fd = open('/tmp/work', 'w+') | Yes | Allocated | Yes |
| 2. Delete | os.unlink('/tmp/work') | No | Still allocated | Yes (via fd) |
| 3. Use | Read/write via file descriptor | No | Grows as written | Yes |
| 4. Close | close(fd) or process exits | No | Freed | No |
That way, the process can read from and write to the file while the file is open. Then when the process finishes, the file gets closed and actually deleted.
Benefits of immediate deletion pattern:
| Benefit | Description | Protection Against |
|---|---|---|
| Guaranteed cleanup | File auto-deleted when closed | Orphaned files |
| Crash resilience | OS cleans up on process death | Failed cleanup code |
| No manual deletion | No cleanup code needed | Programming errors |
| Namespace freed | Filename immediately reusable | Name conflicts |
Now, this system is widely used and works fine for most processes. But if for some reason this temporarily deleted file starts becoming super large, it can end up taking up all the available disk space.
Deleted-but-open file problems:
| Scenario | File Size | Impact | Visibility |
|---|---|---|---|
| Normal temp usage | <100 MB | None | Not listed |
| Large processing | 1-10 GB | Disk space reduced | Not listed |
| Runaway process | >50 GB | Disk exhaustion | Not listed, hard to debug |
| Multiple processes | Many large files | System-wide impact | Very confusing |
If that happens, confusion will result when trying to figure out where most of the data went, since these deleted files won’t be seen.
Important
Deleted but open files consume disk space without appearing in directory listings, making them extremely difficult to diagnose. The df command shows disk usage, but du cannot account for the space because the files don’t exist in the filesystem namespace anymore.
To check for the specific condition, the currently opened files need to be listed and combed for the ones that are known to be deleted.
Detection commands:
1# Linux: List open files marked as deleted
2lsof | grep deleted
3
4# Alternative: Check /proc for deleted files
5find /proc/*/fd -ls 2>/dev/null | grep '(deleted)'
6
7# Show processes with deleted files and their sizes
8lsof -nP | grep '(deleted)' | awk '{print $2, $9, $7}' | \
9 while read pid name size; do
10 echo "PID $pid: $name ($size bytes)"
11 done
12
13# Find large deleted files (>100MB)
14lsof -nP | grep '(deleted)' | \
15 awk '$7 > 104857600 {print $2, $9, $7/1048576 " MB"}'
lsof output interpretation:
1COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
2myapp 1234 root 3w REG 253,0 10737418240 1234 /tmp/bigfile (deleted)
| Column | Value | Meaning |
|---|---|---|
| COMMAND | myapp | Process name |
| PID | 1234 | Process ID |
| FD | 3w | File descriptor 3, open for writing |
| SIZE/OFF | 10737418240 | File size: ~10 GB |
| NAME | /tmp/bigfile (deleted) | File path and deleted status |
Resolution strategies:
| Strategy | Command | Effect | Risk |
|---|---|---|---|
| Truncate file | > /proc/PID/fd/FD | Free space, process continues | May crash process |
| Kill process | kill PID | File freed on exit | Process terminated |
| Graceful shutdown | kill -TERM PID | Clean shutdown | May take time |
| Wait for completion | Monitor process | Natural cleanup | May be too slow |
Of course, there are all kinds of other reasons why the disk may be getting too full. Just remember that whenever this happens, the process will remain the same.
Universal disk troubleshooting workflow:
| Phase | Activities | Tools/Commands | Goal |
|---|---|---|---|
| 1. Investigation | Check disk usage patterns | df, du, lsof | Identify what’s using space |
| 2. Classification | Determine expected vs anomaly | Server role knowledge | Legitimate vs problem |
| 3. Resolution | Fix the issue | Various | Reclaim space |
| 4. Prevention | Implement safeguards | Monitoring, automation | Avoid recurrence |
Time will need to be spent looking into what’s using the disk.
Investigation checklist:
| Check | Command | What to Look For |
|---|---|---|
| Overall usage | df -h | Which filesystems are full |
| Directory sizes | du -sh /* | Top-level space consumers |
| Large files | find / -size +1G | Individual large files |
| Recent growth | find / -mtime -1 -size +100M | Recently created large files |
| Open deleted files | lsof | grep deleted | Hidden space consumers |
| Log files | du -sh /var/log/* | Log accumulation |
| Temp directories | du -sh /tmp /var/tmp | Temporary file buildup |
Check to see if it’s expected or an anomaly.
Expected vs anomaly decision tree:
| Observation | Expected? | Action |
|---|---|---|
| Database files large | Yes (DB server) | Monitor growth rate |
| User files large | Yes (file server) | Confirm within quotas |
| Logs very large | Maybe | Check rotation settings |
| Temp files old | No | Clean up |
| Deleted files open | No | Investigate processes |
| Cache unbounded | No | Implement limits |
Figure out how to solve it.
Resolution strategies by problem type:
| Problem Type | Immediate Action | Long-Term Fix |
|---|---|---|
| Legitimate growth | Add storage | Capacity planning |
| Log overflow | Compress/delete old logs | Configure rotation |
| Temp file accumulation | Delete old temps | Fix cleanup code |
| Cache bloat | Clear cache | Implement eviction |
| Deleted open files | Kill/truncate | Fix application |
| Backup retention | Delete old backups | Set retention policy |
Most important of all, how to prevent it from happening again.
Prevention strategies:
| Strategy | Implementation | Monitoring | Alerting |
|---|---|---|---|
| Disk monitoring | Prometheus, Nagios | Check every 5 min | Alert at 80% |
| Log rotation | logrotate configuration | Daily checks | Alert on rotation failure |
| Cleanup automation | Cron jobs for temp files | Verify execution | Alert on missed runs |
| Quota enforcement | Filesystem quotas | Track per-user usage | Alert on approach |
| Capacity planning | Track growth trends | Weekly reports | Forecast exhaustion |
| Application fixes | Code review, testing | Monitor temp dirs | Alert on anomalies |
Monitoring script example:
1#!/bin/bash
2# disk_monitor.sh - Alert when disk usage exceeds threshold
3
4THRESHOLD=80
5EMAIL="admin@example.com"
6
7df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }' | \
8while read output; do
9 usage=$(echo $output | awk '{ print $1}' | sed 's/%//g')
10 partition=$(echo $output | awk '{ print $2 }')
11
12 if [ $usage -ge $THRESHOLD ]; then
13 echo "ALERT: Partition $partition is ${usage}% full" | \
14 mail -s "Disk Space Alert on $(hostname)" $EMAIL
15 fi
16done
Disk space management represents a critical resource concern where programs consume storage through installed binaries and libraries, application data, caches, logs, temporary files, and backups, with exhaustion potentially caused by legitimate data growth requiring more capacity or program misbehavior through inadequate cleanup of temporary files and logs. As available disk space decreases, overall system performance degrades through data fragmentation causing slower operations, with full disks leading to application crashes when write operations fail and potential catastrophic data loss when programs truncate files before writing updates that then fail due to insufficient space, generating “no space left on device” errors. User machines can often be fixed through simple cleanup like uninstalling unused applications and deleting old data, but servers require detailed investigation to determine whether adding storage capacity is needed or if applications are misbehaving by filling disks with useless data, using commands like df and du to analyze space usage patterns and identify whether large chunks are legitimate data like databases and mailboxes or anomalies like excessive logs and temporary files. Common misbehavior patterns include programs logging error messages repeatedly when continuously failing to start due to configuration problems (potentially generating gigabytes per day), legitimate high-activity logging requiring more frequent log rotation to manage volume, and temporary files that programs fail to clean up either due to crashes preventing cleanup execution or simple programming errors never implementing deletion logic. The particularly tricky scenario of deleted but open files occurs when programs delete temporary files immediately after creation for guaranteed cleanup but before closing them, consuming disk space invisibly without appearing in directory listings since the OS maintains file access for open file descriptors regardless of deletion status, requiring lsof commands to detect these hidden space consumers. The consistent troubleshooting approach involves spending time investigating what uses the disk through filesystem and directory analysis, classifying whether usage is expected based on server role or anomalous requiring intervention, figuring out how to solve the immediate issue through cleanup or capacity addition, and most importantly implementing prevention strategies including disk monitoring with alerts at 80% usage, proper log rotation configuration, automated cleanup scripts for temporary files, and capacity planning to forecast and prevent future exhaustion events.