awk select duplicated lines
3
0
Entering edit mode
9.7 years ago
juncheng ▴ 220

Dear all,

I have a file like this:

chr1 1720466 1720593 - 0 0 2
chr1 1795684 1795915 - 0 0 2
chr1 2404708 2405331 - 0 0 0
chr1 2791001 2791133 + 0 0 1
chr1 2839789 2849732 + 2 0 2
chr1 3220352 3220462 - 0 0 4
chr1 3278790 3278835 + 0 0 4
chr1 3278790 3278835 + 0 0 4

I want to print out all the duplicated lines (what I mean duplicate is the first 3 column is identical!!), by awk. Could anyone give me a example code?

Thanks
Jun

awk • 13k views
ADD COMMENT
2
Entering edit mode

So in this case, the last two lines are duplicated. You want to print just one or them or both the lines? Do you have lines where the first three columns are duplicated but the last three columns are unique?

ADD REPLY
0
Entering edit mode

I want to print all duplicates, this just a example list.

ADD REPLY
0
Entering edit mode

I have lines like this:

chr1 7860756 7973866 - 1 0 0
chr1 7860756 7973866 + 2 1 0
chr1 7860756 7973866 + 2 1 0

I consider this three lines are duplicated. So, I want print all them out.

Thanks.

ADD REPLY
2
Entering edit mode
9.7 years ago
komal.rathi ★ 4.1k

This is how you find duplicated lines based on first three columns:

awk '{if (x[$3]) { x_count[$3]++; print $0; if (x_count[$3] == 1) { print x[$3] } } x[$3] = $0}' sample.txt

output:

chr1 3278790 3278835 + 0 0 4
chr1 3278790 3278835 + 0 0 4

Source: simple google search

ADD COMMENT
1
Entering edit mode

I'm sorry, this script has bug:

test this:

chr1 11904306 11904806 - 0 0 0
chr1 11904306 11904806 - 0 0 0
chr1 11904306 11904806 - 0 0 0
chr1 11904389 11904841 - 0 0 0
chr1 11904574 11904806 - 0 0 1
chr1 11904686 11904806 - 0 0 0
chr1 11904687 11904806 - 0 0 0
chr1 11904687 11904840 - 0 0 0

Output:

chr1 11904306 11904806 - 0 0 0
chr1 11904306 11904806 - 0 0 0
chr1 11904306 11904806 - 0 0 0
chr1 11904574 11904806 - 0 0 1
chr1 11904686 11904806 - 0 0 0
chr1 11904687 11904806 - 0 0 0

The last three rows are not replicates. Could you help me improve that?

ADD REPLY
0
Entering edit mode

Thanks for helping!

ADD REPLY
0
Entering edit mode

You are only tracking the third column. You can make a string by concatenating first three columns and use it as an identifier.

ADD REPLY
1
Entering edit mode
9.7 years ago

A sloppy but easy way when your columns are not sorted (chromosome). Assuming it to be a tab separated file.

awk '{print $1"\t"$2"\t"$3}' data.txt | sort | uniq -d|grep -F -f - data.txt
ADD COMMENT
0
Entering edit mode
9.7 years ago
juncheng ▴ 220

The correct way is, in case anyone need:

awk '{if (x[$1$2$3]) { x_count[$1$2$3]++; print $0; if (x_count[$1$2$3] == 1) { print x[$1$2$3] } } x[$1$2$3] = $0}'
ADD COMMENT

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6