Managing Disk Space

This document addresses disk space management challenges in IT systems covering common causes of disk exhaustion from logs to temporary files. It explores diagnostic techniques for identifying space usage patterns, handling deleted but open files, and implementing preventive strategies to avoid disk-related performance degradation and data loss.

This document examines disk space management as a critical system resource, exploring how programs consume storage through binaries, data, caches, logs, and temporary files. It covers diagnostic approaches for identifying space usage patterns, understanding performance degradation as disks fill up, and implementing strategies to prevent disk exhaustion that can cause application crashes and potential data loss.


Understanding Disk Space Usage

Why Programs Need Disk Space

Another resource that might need attention is the disk usage of computers. Programs may need disk space for lots of different reasons.

Common disk space consumers:

TypePurposeGrowth PatternCleanup Frequency
Installed binariesApplication executablesStableOn uninstall
LibrariesShared code dependenciesStableOn uninstall
Application dataUser and system dataGrowingUser-driven
Cache informationPerformance optimizationGrowing/stablePeriodic
LogsSystem and application eventsContinuously growingRotation-based
Temporary filesIntermediate processingVariesShould be automatic
BackupsData redundancyGrowingRetention policy

Potential Causes of Space Exhaustion

If a computer is running out of space, it’s possible that there’s an attempt to store too much data in too little space.

Space exhaustion scenarios:

ScenarioCauseLikelihoodSolution Type
Legitimate growthToo many applications or large filesCommonAdd storage capacity
Program misbehaviorTemporary files not cleanedVery commonFix cleanup logic
Log overflowExcessive logging without rotationCommonConfigure log rotation
Cache accumulationNo cache eviction policyModerateImplement cache limits
Backup retentionOld backups never deletedModerateSet retention policy

Maybe there are too many applications installed, or an attempt to store too many large files in the drive.

Program Misuse of Disk Space

But it’s also possible that programs are misusing the space allotted to them, like by keeping temporary files or caching information that doesn’t get cleaned up quickly enough or at all.

Misuse patterns:

Misuse TypeBehaviorImpact TimelineDetection Method
Temporary file retentionNever deleting temp filesDays to weeksDirectory size monitoring
Unbounded cachingCache grows indefinitelyWeeks to monthsCache directory analysis
Excessive loggingHigh-frequency log writesHours to daysLog file growth rate
Failed cleanupCrash prevents deletionPer crashOrphaned file detection

Performance Impact of Disk Exhaustion

System-Wide Performance Degradation

It’s common for the overall performance of the system to decrease as the available disk space gets smaller.

Performance degradation stages:

Disk UsageAvailable SpacePerformance ImpactUser Experience
0-50%Plenty freeNormalFast operations
50-80%Moderate freeSlight slowdownBarely noticeable
80-95%Low freeNoticeable slowdownDelays apparent
95-100%Critical/noneSevere degradationVery slow/crashes

Data Fragmentation

Data starts getting fragmented across the disk, and operations become slower.

Fragmentation effects:

AspectUnfragmented DiskFragmented DiskImpact
File locationContiguous blocksScattered blocksRead time increases
Seek operationsMinimalManyHead movement delays
Write efficiencySequentialRandomSlower writes
Free spaceLarge contiguous blocksMany small gapsAllocation overhead

Fragmentation performance comparison:

OperationContiguous FileFragmented File (10 pieces)Slowdown Factor
Sequential read1 seek + read10 seeks + reads5-10× slower
Random accessDirect accessMultiple seeks3-5× slower
File openingFastSlow (map fragments)2-4× slower

Application Crashes

When a hard drive is full, programs may suddenly crash while trying to write something into disk and finding out that they can’t.

Write failure scenarios:

OperationExpected BehaviorFull Disk BehaviorResult
Log writeAppend to fileWrite failsApplication crash
Save documentUpdate fileNo space errorWork lost
Create temp fileAllocate spaceAllocation failsProcess terminates
Database commitWrite transactionCommit failsData inconsistency

Risk of Data Loss

A full hard drive might even lead to data loss, as some programs might truncate a file before writing an updated version of it, and then fail to write the new content, losing all the data that was stored in it before.

Data loss patterns:

Update PatternStep 1Step 2Full Disk Result
Safe (atomic)Write to temp fileRename over originalTemp write fails, original intact
Unsafe (truncate)Truncate originalWrite new dataTruncate succeeds, write fails, data lost
In-placeSeek to positionOverwrite dataPartial write, corrupted file

Error Messages

If it gets to this point, errors like “no space left on the device” will probably be seen when running applications or in the logs.

Common disk full errors:

Error MessageContextSeverity
“No space left on device”Linux/Unix systemsCritical
“Disk full”General errorCritical
“ENOSPC: no space left”Node.js/JavaScriptCritical
“IOError: [Errno 28]”PythonCritical
“Insufficient disk space”WindowsCritical

Diagnosing Disk Space Issues

User Machine Solutions

So what should be done if a computer runs out of disk space? If it’s a user machine, it might be easily fixed by uninstalling applications that aren’t used, or cleaning up old data that isn’t needed anymore.

User machine cleanup approaches:

Cleanup TypeTargetImpactDifficulty
Uninstall appsUnused applicationsHigh (GB)Easy
Delete downloadsOld download filesModerate (GB)Easy
Clear cachesBrowser, app cachesModerate (MB-GB)Easy
Remove duplicatesDuplicate filesVariesModerate
Archive old filesOld documents, photosHigh (GB)Moderate

Server Investigations

But if it’s a server, a closer look at what’s going on might be needed. Is the issue that an extra drive needs to be added to the server to have more available space, or is it that some application is misbehaving and filling the disk with useless data?

Server diagnostic questions:

QuestionIndicatesAction Required
Is growth expected?Legitimate data increaseAdd storage capacity
Is one directory dominant?Concentrated issueInvestigate specific application
Are files temporary/logs?Cleanup problemFix cleanup processes
Is growth rate abnormal?Application misbehaviorDebug application
Are backups accumulating?Retention issueAdjust backup policy

Space Usage Analysis

To figure this out, examining how the space is being used and what directories are taking up the most space is needed, then drilling down until finding out whether large chunks of space are taken by valid information or by files that should be purged.

Analysis workflow:

StepCommand ExamplePurposeOutput
1. Top-level overviewdf -hShow filesystem usageTotal/used/available per mount
2. Directory breakdowndu -sh /*Identify large directoriesSize of top-level dirs
3. Drill downdu -sh /var/*Investigate suspect dirSubdirectory sizes
4. Find large filesfind / -size +1GLocate specific culpritsFiles over threshold
5. Sort by size`du -hsort -rhhead -20`Rank consumersTop 20 space users

Common disk usage commands:

 1# Check overall disk usage
 2df -h
 3
 4# Find directories using most space
 5du -sh /* | sort -rh | head -10
 6
 7# Find large files over 100MB
 8find / -type f -size +100M -exec ls -lh {} \;
 9
10# Check disk usage by directory, sorted
11du -h /var | sort -rh | head -20
12
13# Find files modified in last 7 days
14find /var/log -mtime -7 -type f -exec du -sh {} \; | sort -rh

Expected vs Anomalous Usage

For example, on a database server, it’s expected that the bulk of the disk space is going to be used by the data stored in the database. On a mail server, it’s going to be the mailboxes of the users of that service.

Expected space usage by server type:

Server TypeExpected Primary ConsumerTypical SizeAnomaly Threshold
DatabaseDatabase files (/var/lib/mysql)50-90% of diskLogs >10%
MailUser mailboxes (/var/mail)60-90% of diskTemp files >5%
WebStatic content (/var/www)30-60% of diskLogs >20%
FileShared files (/shares)70-95% of diskSystem >5%
ApplicationApplication data40-70% of diskLogs >15%

But if most of the data is found to be stored in logs or in temporary files, something has gone wrong.

Anomalous usage indicators:

DirectoryNormal SizeAnomalous SizeLikely Issue
/var/log<5% disk>20% diskLog rotation failure
/tmp<2% disk>10% diskTemp file cleanup failure
/var/cache<10% disk>30% diskCache eviction not working
/var/spool<5% disk>15% diskQueue processing stuck

Common Misbehavior Patterns

Excessive Error Logging

One common pattern of misbehavior is a program that keeps logging error messages to the system log over and over. This can happen for lots of different reasons.

Excessive logging scenarios:

CauseFrequencyGrowth RateExample
Configuration errorContinuous retriesMB to GB per hourService fails to start
Network timeoutPer requestGB per dayAPI endpoint down
Permission deniedPer access attemptMB per hourFile access failure
Dependency failurePer health checkGB per dayDatabase unreachable

OS Retry Loops

For example, the OS might keep trying to start a program that fails because of a configuration problem. This will generate a new log entry with every retry and can take up a lot of space if there are several retries per second.

Retry pattern impact:

Retry RateLog Entry SizeSpace Used Per HourSpace Used Per Day
1 per second200 bytes~700 KB~17 MB
10 per second200 bytes~7 MB~168 MB
100 per second200 bytes~70 MB~1.7 GB
1000 per second200 bytes~700 MB~17 GB

Example error log loop:

1Nov 11 10:15:01 server systemd[1]: Starting myapp.service...
2Nov 11 10:15:01 server myapp[1234]: Configuration file not found: /etc/myapp/config.yml
3Nov 11 10:15:01 server systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE
4Nov 11 10:15:01 server systemd[1]: myapp.service: Failed with result 'exit-code'.
5Nov 11 10:15:02 server systemd[1]: Starting myapp.service...
6Nov 11 10:15:02 server myapp[1235]: Configuration file not found: /etc/myapp/config.yml
7# ... repeats thousands of times ...

High-Volume Legitimate Logging

Or it could be that the server has a lot of activity and the logs are real, but there are just too many of them.

High-activity logging management:

Activity LevelLogs Per DayRotation StrategyRetention Period
Low<100 MBWeekly rotation30 days
Moderate100 MB - 1 GBDaily rotation7-14 days
High1-10 GBHourly rotation3-7 days
Very high>10 GBContinuous/size-based1-3 days

In that case, tweaking the configuration of the tools that rotate the logs more frequently might be needed to make sure that only what’s needed is being kept.

Log rotation configuration strategies:

StrategyConfigurationBenefitTrade-off
Size-basedRotate when >100MBPredictable disk usageUneven time periods
Time-basedRotate daily at midnightRegular scheduleVariable file sizes
CompressionGzip old logsSave 80-90% spaceCPU overhead
Remote shippingSend to log serverLocal disk protectedNetwork dependency
Reduced verbosityLower log levelLess data writtenLess debugging info

Example logrotate configuration:

 1/var/log/myapp/*.log {
 2    daily                  # Rotate daily
 3    rotate 7              # Keep 7 days
 4    compress              # Compress old logs
 5    delaycompress         # Don't compress most recent
 6    missingok            # Don't error if log missing
 7    notifempty           # Don't rotate if empty
 8    create 0644 root root # Create new file with permissions
 9    size 100M            # Also rotate if >100MB
10    postrotate
11        systemctl reload myapp
12    endscript
13}

Temporary File Issues

Uncleaned Temporary Files

In other cases, the disk might get full due to a program generating large temporary files and then failing to clean those up.

Temporary file problems:

Problem TypeCauseAccumulation RateDetection
Crash cleanup failureProcess killed unexpectedlyPer crashGrowing /tmp directory
Programming errorMissing cleanup codeContinuousTemp files with old timestamps
Failed cleanup logicError in cleanup routineVariesFiles matching temp pattern
Partial processingJob interruptedPer failed jobIncomplete file sets

Cleanup on Normal vs Abnormal Exit

For example, an application might clean up temporary files when shutting down cleanly, but leave them behind if it crashes.

Cleanup behavior comparison:

Exit TypeCleanup TriggerTemp FilesResult
Normal shutdownExit handler calledDeletedClean /tmp
Graceful signal (SIGTERM)Signal handlerDeletedClean /tmp
Kill signal (SIGKILL)NoneRemain/tmp accumulates
Crash/exceptionMay not executeRemain/tmp accumulates
Power lossNoneRemain/tmp accumulates

Programming Errors

Or it could simply be a programming error of creating temporary files and never cleaning them up.

Programming error patterns:

 1# Bad: Temporary file never cleaned up
 2def process_data(input_file):
 3    temp_file = "/tmp/processing_" + str(time.time())
 4    with open(temp_file, 'w') as f:
 5        # Process data...
 6        f.write(processed_data)
 7    # File left behind forever
 8    return result
 9
10# Good: Explicit cleanup
11def process_data_better(input_file):
12    temp_file = "/tmp/processing_" + str(time.time())
13    try:
14        with open(temp_file, 'w') as f:
15            f.write(processed_data)
16        # Process temp file
17        result = process(temp_file)
18    finally:
19        if os.path.exists(temp_file):
20            os.remove(temp_file)
21    return result
22
23# Best: Use tempfile module
24import tempfile
25
26def process_data_best(input_file):
27    with tempfile.NamedTemporaryFile(mode='w', delete=True) as temp_file:
28        temp_file.write(processed_data)
29        temp_file.flush()
30        result = process(temp_file.name)
31    # File automatically deleted when context exits
32    return result

Solutions for Temporary File Problems

In a case like this, ideally there would be some housekeeping to fix the program and delete those files correctly.

Temporary file solutions:

SolutionApproachPermanenceEffort
Fix applicationCorrect cleanup codePermanentHigh
Add signal handlersCatch SIGTERMPermanentMedium
Use proper temp APIstempfile module/mktempPermanentMedium
Cleanup scriptScheduled deletionWorkaroundLow
tmpfs mountRAM-based /tmpSystem reboot cleansMedium

But if that’s not possible, writing a custom script that gets rid of them might be needed.

Cleanup script example:

 1#!/bin/bash
 2# cleanup_temp_files.sh - Remove old temporary files
 3
 4# Find and delete temp files older than 7 days
 5find /tmp -type f -name "processing_*" -mtime +7 -delete
 6
 7# Find and delete empty directories
 8find /tmp -type d -empty -delete
 9
10# Log cleanup action
11echo "$(date): Cleaned up old temporary files" >> /var/log/temp_cleanup.log

Cron job for automated cleanup:

1# Run cleanup script daily at 2 AM
20 2 * * * /usr/local/bin/cleanup_temp_files.sh

Deleted But Open Files

The Tricky Debugging Situation

A situation that might be tricky to debug is when the files taking up the space are deleted files.

How deleted files consume space:

File StateVisible in ListingDisk Space UsedProcess Access
Normal open fileYesYesYes
Deleted but openNoYes (still allocated)Yes (via file descriptor)
Closed & deletedNoNo (freed)No

If a program opens a file, the OS lets that program read and write in the file regardless of whether the file is marked as deleted or not.

Intentional Deletion Pattern

So lots of programs delete the temporary files they create right after opening to avoid issues with failing to clean them up later.

Temporary file lifecycle with immediate deletion:

StepActionFile VisibleDisk SpaceProcess Access
1. Createfd = open('/tmp/work', 'w+')YesAllocatedYes
2. Deleteos.unlink('/tmp/work')NoStill allocatedYes (via fd)
3. UseRead/write via file descriptorNoGrows as writtenYes
4. Closeclose(fd) or process exitsNoFreedNo

That way, the process can read from and write to the file while the file is open. Then when the process finishes, the file gets closed and actually deleted.

Benefits of immediate deletion pattern:

BenefitDescriptionProtection Against
Guaranteed cleanupFile auto-deleted when closedOrphaned files
Crash resilienceOS cleans up on process deathFailed cleanup code
No manual deletionNo cleanup code neededProgramming errors
Namespace freedFilename immediately reusableName conflicts

When Things Go Wrong

Now, this system is widely used and works fine for most processes. But if for some reason this temporarily deleted file starts becoming super large, it can end up taking up all the available disk space.

Deleted-but-open file problems:

ScenarioFile SizeImpactVisibility
Normal temp usage<100 MBNoneNot listed
Large processing1-10 GBDisk space reducedNot listed
Runaway process>50 GBDisk exhaustionNot listed, hard to debug
Multiple processesMany large filesSystem-wide impactVery confusing

If that happens, confusion will result when trying to figure out where most of the data went, since these deleted files won’t be seen.

Detecting Deleted Open Files

To check for the specific condition, the currently opened files need to be listed and combed for the ones that are known to be deleted.

Detection commands:

 1# Linux: List open files marked as deleted
 2lsof | grep deleted
 3
 4# Alternative: Check /proc for deleted files
 5find /proc/*/fd -ls 2>/dev/null | grep '(deleted)'
 6
 7# Show processes with deleted files and their sizes
 8lsof -nP | grep '(deleted)' | awk '{print $2, $9, $7}' | \
 9  while read pid name size; do
10    echo "PID $pid: $name ($size bytes)"
11  done
12
13# Find large deleted files (>100MB)
14lsof -nP | grep '(deleted)' | \
15  awk '$7 > 104857600 {print $2, $9, $7/1048576 " MB"}'

lsof output interpretation:

1COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
2myapp    1234 root   3w   REG  253,0 10737418240 1234 /tmp/bigfile (deleted)
ColumnValueMeaning
COMMANDmyappProcess name
PID1234Process ID
FD3wFile descriptor 3, open for writing
SIZE/OFF10737418240File size: ~10 GB
NAME/tmp/bigfile (deleted)File path and deleted status

Resolution strategies:

StrategyCommandEffectRisk
Truncate file> /proc/PID/fd/FDFree space, process continuesMay crash process
Kill processkill PIDFile freed on exitProcess terminated
Graceful shutdownkill -TERM PIDClean shutdownMay take time
Wait for completionMonitor processNatural cleanupMay be too slow

General Troubleshooting Approach

Consistent Problem-Solving Process

Of course, there are all kinds of other reasons why the disk may be getting too full. Just remember that whenever this happens, the process will remain the same.

Universal disk troubleshooting workflow:

PhaseActivitiesTools/CommandsGoal
1. InvestigationCheck disk usage patternsdf, du, lsofIdentify what’s using space
2. ClassificationDetermine expected vs anomalyServer role knowledgeLegitimate vs problem
3. ResolutionFix the issueVariousReclaim space
4. PreventionImplement safeguardsMonitoring, automationAvoid recurrence

Investigation Phase

Time will need to be spent looking into what’s using the disk.

Investigation checklist:

CheckCommandWhat to Look For
Overall usagedf -hWhich filesystems are full
Directory sizesdu -sh /*Top-level space consumers
Large filesfind / -size +1GIndividual large files
Recent growthfind / -mtime -1 -size +100MRecently created large files
Open deleted fileslsof | grep deletedHidden space consumers
Log filesdu -sh /var/log/*Log accumulation
Temp directoriesdu -sh /tmp /var/tmpTemporary file buildup

Classification Phase

Check to see if it’s expected or an anomaly.

Expected vs anomaly decision tree:

ObservationExpected?Action
Database files largeYes (DB server)Monitor growth rate
User files largeYes (file server)Confirm within quotas
Logs very largeMaybeCheck rotation settings
Temp files oldNoClean up
Deleted files openNoInvestigate processes
Cache unboundedNoImplement limits

Resolution Phase

Figure out how to solve it.

Resolution strategies by problem type:

Problem TypeImmediate ActionLong-Term Fix
Legitimate growthAdd storageCapacity planning
Log overflowCompress/delete old logsConfigure rotation
Temp file accumulationDelete old tempsFix cleanup code
Cache bloatClear cacheImplement eviction
Deleted open filesKill/truncateFix application
Backup retentionDelete old backupsSet retention policy

Prevention Phase

Most important of all, how to prevent it from happening again.

Prevention strategies:

StrategyImplementationMonitoringAlerting
Disk monitoringPrometheus, NagiosCheck every 5 minAlert at 80%
Log rotationlogrotate configurationDaily checksAlert on rotation failure
Cleanup automationCron jobs for temp filesVerify executionAlert on missed runs
Quota enforcementFilesystem quotasTrack per-user usageAlert on approach
Capacity planningTrack growth trendsWeekly reportsForecast exhaustion
Application fixesCode review, testingMonitor temp dirsAlert on anomalies

Monitoring script example:

 1#!/bin/bash
 2# disk_monitor.sh - Alert when disk usage exceeds threshold
 3
 4THRESHOLD=80
 5EMAIL="admin@example.com"
 6
 7df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }' | \
 8while read output; do
 9  usage=$(echo $output | awk '{ print $1}' | sed 's/%//g')
10  partition=$(echo $output | awk '{ print $2 }')
11
12  if [ $usage -ge $THRESHOLD ]; then
13    echo "ALERT: Partition $partition is ${usage}% full" | \
14      mail -s "Disk Space Alert on $(hostname)" $EMAIL
15  fi
16done

Conclusion

Disk space management represents a critical resource concern where programs consume storage through installed binaries and libraries, application data, caches, logs, temporary files, and backups, with exhaustion potentially caused by legitimate data growth requiring more capacity or program misbehavior through inadequate cleanup of temporary files and logs. As available disk space decreases, overall system performance degrades through data fragmentation causing slower operations, with full disks leading to application crashes when write operations fail and potential catastrophic data loss when programs truncate files before writing updates that then fail due to insufficient space, generating “no space left on device” errors. User machines can often be fixed through simple cleanup like uninstalling unused applications and deleting old data, but servers require detailed investigation to determine whether adding storage capacity is needed or if applications are misbehaving by filling disks with useless data, using commands like df and du to analyze space usage patterns and identify whether large chunks are legitimate data like databases and mailboxes or anomalies like excessive logs and temporary files. Common misbehavior patterns include programs logging error messages repeatedly when continuously failing to start due to configuration problems (potentially generating gigabytes per day), legitimate high-activity logging requiring more frequent log rotation to manage volume, and temporary files that programs fail to clean up either due to crashes preventing cleanup execution or simple programming errors never implementing deletion logic. The particularly tricky scenario of deleted but open files occurs when programs delete temporary files immediately after creation for guaranteed cleanup but before closing them, consuming disk space invisibly without appearing in directory listings since the OS maintains file access for open file descriptors regardless of deletion status, requiring lsof commands to detect these hidden space consumers. The consistent troubleshooting approach involves spending time investigating what uses the disk through filesystem and directory analysis, classifying whether usage is expected based on server role or anomalous requiring intervention, figuring out how to solve the immediate issue through cleanup or capacity addition, and most importantly implementing prevention strategies including disk monitoring with alerts at 80% usage, proper log rotation configuration, automated cleanup scripts for temporary files, and capacity planning to forecast and prevent future exhaustion events.


FAQ