How To Remove Duplicate Lines While Maintaining Order in Linux
The most elegant solutions.
# Order not preserved (lines sorted) sort file.txt | uniq
# Display first occurrence awk '!v[$0]++' file.txt
# Display last occurrence tac file.txt | awk '!v[$0]++' | tac
Without Preserving Order
If order doesn’t matter, these are two options for removing duplicate lines.
sort file.txt | uniq sort -u file.txt
uniq only removes adjacent duplicate lines, which is why we
-u forces unique lines while sorting.
Given the following
111 222 222 111
We can either print the first or last occurrences of duplicates:
# First # Last 111 222 222 111
Print First Occurrence of Duplicates
cat -n file.txt | sort -uk2 | sort -nk1 | cut -f2-
cat -n adds an order number to each line in order to store the original order.
sort -uk2 sorts the lines in the second column (
-k2) and keep only first occurrence of duplicates (
sort -nk1 returns to original order by sorting the order numbers in the first column (
-k1) and treating the values as numbers (
cut -f2- prints only the second column, or field, which is the line itself
Another way to achieve this is to use
awk '!v[$0]++' file.txt
This command will use a dictionary (a.k.a. map, associative array)
v to store each line and their number of occurrences, or frequency, in the file so far.
!v[$0]++ will be run on every line in the file.
$0 holds the value of the current line being processed.
v[$0] checks for the number of occurrences of the current line so far.
!v[$0] returns true when
v[$0] == 0, or when the current line is not a duplicate. This is when the line is printed (the print statement is omitted for simplicity).
v[$0]++ will increment the frequency of the current line by one.
Print Last Occurence of Duplicates
In order to print the last occurence of the duplicate line, we can use
tac, which reverses the specified file.
tac file.txt > file1.txt; cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2- > file2.txt; tac file2.txt > file3.txt; cat file3.txt
tac file.txt | awk '!v[$0]++' | tac
Useful Tricks To Know
# Display only unique/duplicate lines sort file.txt | uniq -u # Unique sort file.txt | uniq -d # Duplicate
# Display number of duplicates per line sort file.txt | uniq -uc sort file.txt | uniq -dc
# Skip first 10 characters uniq -s 10 file.txt
# Compare first 10 characters uniq -w 10 file.txt