Question

Remove rows with duplicate SNP identifiers in first column

0

Entering edit mode

21 months ago

bsp017 ▴ 50

I have a large database of SNPs (>53 million). Each SNP has a identifier in column 1. The remaining columns (483) contain genotype data (0,1,-1). Some SNPs are duplicated in the database. I have tried to remove the duplicates but can't seem to weed them out with common unix commands such as:

awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a[i]}'

or

awk '!a[$1]++'

The SNP identifiers look like this Contig0_50, so the awk commands finds non exact duplicates e.g. Contig0_500

Can someone suggested how to remove duplicate identifiers in the database?, i.e. if identifier 1 is exactly the same as identifier 2, remove the entire row which contains identifier 2. The resulting database will only have unique identifiers.

Thanks,

James

grep unix awk • 997 views

ADD COMMENT • link updated 21 months ago by Matthias Zepper 4.6k • written 21 months ago by bsp017 ▴ 50

0

Entering edit mode

What file format is your database in?

ADD REPLY • link 21 months ago by Jeremy ▴ 910

0

Entering edit mode

It's a flat text file

ADD REPLY • link 21 months ago by bsp017 ▴ 50

2

Entering edit mode

This is odd. In particular, the second awk command would be the textbook example of how to achieve the deduplication of a file, and I also can't reproduce this bug with the example SNP names you have given. Admittedly, I don't think that the similar identifiers are the problem.

Rather, I fear, that the file might be too big for your memory, because what you are essentially doing here is writing the whole file into the array a. Given that you say it is a file of 53 million records and 484 columns, awk or your memory might just be unable to handle this.

You could try to use sort -u db.txt > outdb.txt, but that will only remove duplicated lines and not solely duplicated IDs and might also run into memory issues.

Other than that, you would need to find a way how to remove duplicates without keeping the whole file in memory. This here could work:

sort -k 1 db.txt > dbsorted.txt
awk '{a[NR]=$1; delete a[NR-2]}; {if (a[NR-1]!=$1){print $0}}' dbsorted.txt > dbdedup.txt

This will only compare the current identifier to the identifier in the previous line, without keeping any more than 2 identifiers in memory at a given time, and evidently requires a properly sorted input.

I haven't thoroughly tested it, though, so please at least also run the cross-control with if (a[NR-1]==$1) to print only the duplicates...

ADD REPLY • link 21 months ago by Matthias Zepper 4.6k

0

Entering edit mode

That fixes it:

wc -l dbsorted.txt 
53650725 dbsorted.txt

    wc -l dbdedup.txt
51456321 dbdedup.txt

awk '{a[NR]=$1; delete a[NR-2]}; {if (a[NR-1]==$1){print $0}}' dbsorted.txt > onlydedups.txt
wc -l onlydedups.txt 
2194404 onlydedups.txt

Thanks!

ADD REPLY • link 21 months ago by bsp017 ▴ 50

0

Entering edit mode

You're welcome!

PS: To print out duplicates, including the first occurrences:

sort -k 1 db.txt > dbsorted.txt
awk '{a[NR]=$1; b[NR]=$0; delete a[NR-2]; delete b[NR-2]}; {if (a[NR-1]==$1){print b[NR-1]"\n"$0}}' dbsorted.txt > allduplicatedentries.txt

ADD REPLY • link 21 months ago by Matthias Zepper 4.6k