How to count unique occurrences of lines in Linux
1
0
Entering edit mode
3 months ago
Alex S ▴ 20

I have a file that looks like this:

C.Chr1:75500000-95000000:1029180-1029225
C.Chr1:75500000-95000000:1033800-1033847
C.Chr1:75500000-95000000:1035240-1035285
C.Chr1:75500000-95000000:1035460-1035505
C.Chr2:584000000-610000000:17911000-17911047
C.Chr2:584000000-610000000:17911000-17911047
C.Chr2:584000000-610000000:17911000-17911047
C.Chr3:30000000-130000000:21437320-21437367
C.Chr3:30000000-130000000:21437380-21437425
C.Chr3:30000000-130000000:21437700-21437747
C.Chr3:30000000-130000000:21438080-21438127


I need to count how many lines are unique, not considering the repeated lines.

I've tried uniq -c | sort -bgr but the number of lines is way smaller than expected, and I think it can be a problem in the uniq function.

Anyone knows another code or function that would help?

uniq Linux ubuntu • 564 views
3
Entering edit mode
3 months ago
sort <file> | uniq -u | wc -l


(nearly) always pass sorted files to uniq , then use uniq -u (to report the unique lines) then pass to wc -l for the counting

(keep in mind this will count the lines that are unique in your original file, NOT the number of lines when the files has been made non-redundant)

0
Entering edit mode

I like sort -u followed by uniq. (Had a situation recently where uniq did not work on its own, it is probably redundant here).

0
Entering edit mode

I like sort -u followed by uniq.

sort -u does not need to be followed by uniq as it already constricts the file to its unique subset.

sort -u <file> | wc -l

0
Entering edit mode

sort -u will result in non-redundant subset, not unique. For anything unique you will need to use uniq (with the -u option)

yes, it's a bit semantics but it is crucial in certain circumstances.

0
Entering edit mode

I am trying to understand if this is a distinction without a difference, or something that can be important in practice. What would be an example on multiple lines in a file where sort -u <file> | wc -l and sort <file> | uniq -u | wc -l will give a different output?

In man uniq it says:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'.

1
Entering edit mode

well,

uniq -u only prints the unique lines in the files (== those that are only present once and no others) ; it's the opposite behaviour of uniq -d (== print only lines that are repeated in the input file)

sort -u makes the file non-redundant (== one representative of each repeated line is kept)

of course, and indeed, this all only applies when files are correctly sorted (though running uniq on unsorted files sometimes pretty useful to get a desired result)

sort <file> | uniq will give the exact same output as sort -u <file> (and the same as sort -u file | uniq -u for that matter , but that's just a waste of option usage :) )

0
Entering edit mode

It works!! Thanks a lot.