Essential Linux commands for text file manipulation including sort, uniq grep cut, and paste, with practical examples for sorting data, filtering content pattern matching, and combining files.
This document explores powerful Linux commands for manipulating and processing text files. It covers sorting lines with sort, removing duplicates with uniq, pattern matching with grep, extracting specific content with cut, and combining files with paste. These utilities provide a robust toolkit for text data processing, enabling efficient transformation and analysis of text-based information in Linux systems.
The sort command is a versatile utility that arranges the lines of text files in alphanumeric order. This is particularly useful for organizing data, preparing files for further processing, and making content more readable.
The basic syntax of the sort command is:
1sort [options] [file]
When executed without options, sort arranges lines alphabetically:
1$ cat pets.txt
2cat
3dog
4cat
5dog
6cat
7cat
8cat
9
10$ sort pets.txt
11cat
12cat
13cat
14cat
15cat
16dog
17dog
To reverse the sorting order, use the -r option:
1$ sort -r pets.txt
2dog
3dog
4cat
5cat
6cat
7cat
8cat
The sort command offers several options for specialized sorting operations:
| Option | Description | Example Use Case |
|---|---|---|
-n | Numeric sort | Sorting files containing numerical values |
-f | Case-insensitive sort | Sorting text regardless of uppercase/lowercase |
-u | Unique sort (removes duplicates) | Sorting and deduplicating in one step |
-k | Sort by specific field | Sorting CSV or tabular data by column |
-t | Specify field separator | Defining custom field delimiters for -k option |
The uniq command filters out repeated lines in a text file, but it only works on adjacent duplicate lines. This means the input typically needs to be sorted first for uniq to effectively remove all duplicates.
The basic syntax of the uniq command is:
1uniq [options] [input_file] [output_file]
When executed on a file with adjacent repeated lines:
1$ cat pets.txt
2cat
3dog
4cat
5dog
6cat
7cat
8cat
9
10$ uniq pets.txt
11cat
12dog
13cat
14dog
15cat
Notice that uniq only removes consecutive duplicate lines. In this example, “cat” appears multiple times because the repeated lines are not all adjacent.
The uniq command offers several useful options:
| Option | Description | Example Use Case |
|---|---|---|
-c | Count occurrences | Counting frequency of each line |
-d | Show only duplicate lines | Identifying repeated content |
-u | Show only unique lines | Filtering out any repeated content |
-i | Case-insensitive comparison | Ignoring case when identifying duplicates |
To effectively remove all duplicate lines regardless of their position in the file, combine sort and uniq:
1$ sort pets.txt | uniq
2cat
3dog
This pipeline first sorts all lines alphabetically, bringing identical lines together, then removes the duplicates.
The grep (Global Regular Expression Print) command searches for specific patterns in text files and displays the matching lines. It’s one of the most powerful and frequently used commands for text processing in Linux.
The basic syntax of the grep command is:
1grep [options] pattern [file]
For example, to find all lines containing the characters “ch” in a file:
1$ cat people.txt
2Alan Turing
3Charles Babbage
4Dennis Ritchie
5Erwin Schrodinger
6
7$ grep "ch" people.txt
8Dennis Ritchie
9Erwin Schrodinger
To perform case-insensitive searches, use the -i option:
1$ grep -i "ch" people.txt
2Charles Babbage
3Dennis Ritchie
4Erwin Schrodinger
This matches “ch” regardless of case, finding “Charles Babbage” which contains “Ch”.
The grep command offers numerous options for refined searching:
| Option | Description | Example Use Case |
|---|---|---|
-v | Invert match (show non-matching lines) | Excluding lines with specific content |
-n | Show line numbers | Identifying the position of matches |
-c | Count matching lines | Quantifying pattern occurrences |
-l | List only filenames with matches | Finding files containing a pattern |
-r | Recursive search through directories | Searching across multiple files/folders |
-E | Extended regular expressions | Using advanced pattern matching |
The cut command extracts specific sections from each line of a file. It can extract content based on character positions, field separators, or bytes.
To extract specific characters from each line based on their position:
1$ cat people.txt
2Alan Turing
3Charles Babbage
4Dennis Ritchie
5Erwin Schrodinger
6
7$ cut -c 2-9 people.txt
8lan Turi
9harles B
10ennis Ri
11rwin Sch
This command extracts characters from position 2 through 9 from each line.
The cut command can also extract specific fields from lines that contain delimited data:
1$ cut -d " " -f 2 people.txt
2Turing
3Babbage
4Ritchie
5Schrodinger
In this example:
-d " " specifies the space character as the field delimiter-f 2 extracts the second field (last name) from each line| Option | Description | Example Use Case |
|---|---|---|
-c | Extract by character position | Parsing fixed-width fields |
-f | Extract by field number | Processing columnar data |
-d | Specify field delimiter | Defining custom field separators |
--complement | Invert the selection | Extracting everything except specified parts |
The paste command merges lines from multiple files in parallel, creating a single output. This is particularly useful for joining related data stored in separate files.
The basic syntax of the paste command is:
1paste [options] [file1] [file2] ...
For example, merging three files containing first names, last names, and birth years:
1$ cat first.txt
2Alan
3Charles
4Dennis
5Erwin
6
7$ cat last.txt
8Turing
9Babbage
10Ritchie
11Schrodinger
12
13$ cat yob.txt
141912
151791
161941
171887
18
19$ paste first.txt last.txt yob.txt
20Alan Turing 1912
21Charles Babbage 1791
22Dennis Ritchie 1941
23Erwin Schrodinger 1887
By default, paste uses tabs as delimiters between the fields from each file.
To specify a different delimiter, use the -d option:
1$ paste -d "," first.txt last.txt yob.txt
2Alan,Turing,1912
3Charles,Babbage,1791
4Dennis,Ritchie,1941
5Erwin,Schrodinger,1887
This creates a comma-separated values (CSV) format, which is commonly used for tabular data.
Linux offers a powerful set of commands for manipulating text files, making it an excellent environment for data processing and analysis. The sort command organizes data alphabetically or numerically, while uniq removes duplicate entries. Pattern matching with grep enables targeted content extraction based on specific criteria. The cut command provides precision in extracting parts of each line, and paste combines related data from multiple files. Together, these commands form a versatile toolkit for text processing, enabling complex transformations through simple command pipelines.
(3) The sort command is typically used before uniq to make duplicate removal effective. This is because uniq only removes consecutive duplicate lines. By first sorting the file, identical lines are brought together, allowing uniq to remove all duplicates regardless of their original positions in the file. This sort | uniq pipeline is a common pattern in Linux text processing.
(3) The cut -d “,” -f 3 data.csv command is the most appropriate choice. This command uses cut to extract specific fields from each line, with -d “,” specifying the comma as the field delimiter and -f 3 selecting the third field. This is precisely what’s needed to extract the third column from a CSV (Comma-Separated Values) file. The other options either extract the wrong content or perform different operations entirely.
(4) The paste command can combine more than two files at once, not just a maximum of two. It can take multiple file arguments and merge them all together in parallel, with each file’s content becoming a column in the output. This is particularly useful for combining related data from multiple sources into a single tabular format.
The grep command can only search for exact word matches and cannot use regular expressions for pattern matching.
False. The grep command is specifically designed to work with regular expressions - in fact, its name stands for “Global Regular Expression Print.” It can search for complex patterns beyond simple text matches, including character classes, quantifiers, anchors, and other regex features. This capability makes grep an extremely powerful tool for text pattern matching and extraction in Linux.
| Command | Function |
|---|---|
| A. sort | 1. Extracts specific sections from each line |
| B. uniq | 2. Searches for patterns in text |
| C. grep | 3. Arranges lines in alphanumeric order |
| D. cut | 4. Removes duplicate adjacent lines |
| E. paste | 5. Merges lines from multiple files |
A-3, B-4, C-2, D-1, E-5. The sort command arranges lines in alphanumeric order, uniq removes duplicate adjacent lines, grep searches for patterns in text, cut extracts specific sections from each line, and paste merges lines from multiple files.
(3) The fact that uniq only removes consecutive duplicate lines suggests it was designed to be used in combination with other commands, particularly sort. This design philosophy reflects the Unix/Linux approach of creating simple tools that do one thing well and can be combined through pipelines to perform more complex operations. The limitation of uniq actually encourages the modular use of commands working together.
cat file.txt | tr -s '[:space:]' '\n' | sort | uniq -c | sort -nr. This command sequence first transforms the file so each word is on its own line, sorts all words alphabetically, counts the occurrences of each unique word with uniq -c, and finally sorts the results numerically in reverse order to show the most frequent words first. This technique leverages the power of command pipelines to perform complex text analysis with simple tools.