Question: Remove duplicates in an extremely large text file
3
gravatar for OAJn8634
19 months ago by
OAJn863460
OAJn863460 wrote:

I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:

1:116342    rs1000277323
1:173516    rs1000447106
1:168592    rs1000479828
1:102498    rs1000493007

However, plink produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:

awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt

sort  All_SNPs.txt | uniq -u > All_SNPs_nodup.txt

cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt

However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq, and awk).

I will be very grateful for any ideas of how I can make this work. Thank you very much.

PS, this file is far too large to open it in R.

awk snp plink uniq • 1.9k views
ADD COMMENTlink modified 19 months ago by Santosh Anand5.1k • written 19 months ago by OAJn863460
2

You could try to convert the input file to a valid vcf and use than bcftools sort and bcftools norm -N -d none to remove the duplicates. At the end you can convert back to the input format.

ADD REPLYlink written 19 months ago by finswimmer13k

Using datamash:

 datamash -sg 1 unique 2  <test.txt

datamash is available in brew, conda, apt repos.

using tsv-utils :

tsv-uniq --ignore-case -H -f 1 test.txt
ADD REPLYlink modified 19 months ago • written 19 months ago by cpad011214k

Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.

ADD REPLYlink written 19 months ago by geek_y11k
5
gravatar for bruce.moran
19 months ago by
bruce.moran860
Ireland
bruce.moran860 wrote:

Cut and uniq to find duplicates, then grep -v them away. Relatively quick on 389MB dummy file.

time sort -V All_SNPs.txt | cut -f 1 | uniq -c | perl -ane 'if($F[0] ne "1"){print "$F[1]\t$F[0]\n";'} > All_SNPs.dup.chr-pos.txt

real    0m57.428s
user    3m19.091s
sys     0m3.152s

time cut -f 1 All_SNPs.dup.chr-pos.txt | grep -wvf - All_SNPs.txt  > All_SNPs.nodup.txt

real    0m2.516s
user    0m1.697s
sys     0m0.229s
ADD COMMENTlink written 19 months ago by bruce.moran860
2
gravatar for Shred
19 months ago by
Shred180
Shred180 wrote:

Split the text file into smaller one using the split command. You could split by size, as example:

split -b 200m filename

This will produce files named ' xaa, xab, xac..' . Now use awk, but with simpler syntax

awk -F"\t" '!seen[$1]++' xa*

And after that, join files using a sample cat into the destination file.

ADD COMMENTlink written 19 months ago by Shred180
2

How big is the chance that two dups end up in different files?

ADD REPLYlink written 19 months ago by Benn8.0k
2

You can use awk to split your file according to the chromosomes:

awk '{ split($1, a, ":"); print $1"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

That assures you that you don't miss two dups. If you like to reduce the file size a bit, you can remove the chr: :

awk '{ split($1, a, ":"); print a[2]"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

This will also allow you to keep a bit more items in the hash.

ADD REPLYlink written 19 months ago by michael.ante3.6k

Shit happens. But if a file is too large for ram, in bash there's no way to map into memory. Maybe a solution would be in python, as explained here

ADD REPLYlink written 19 months ago by Shred180

Yeah the link is worth a try, or here another awk solution. Can't test it myself, though.

ADD REPLYlink modified 19 months ago • written 19 months ago by Benn8.0k

If you sort first I guess the chance is very small no?

ADD REPLYlink written 19 months ago by Gautier Richard280

Sort loads file into memory too.

ADD REPLYlink written 19 months ago by Shred180

Do you actually have this as a PLINK dataset? Why not try to use PLINK functionality to update the map file? For example, --list-duplicate-vars lists duplicates, which can then be excluded

ADD REPLYlink written 19 months ago by Kevin Blighe65k

This has worked. Thank you very much for the suggestion!

ADD REPLYlink written 19 months ago by OAJn863460
0
gravatar for Benn
19 months ago by
Benn8.0k
Netherlands
Benn8.0k wrote:

Did you try:

awk '!seen[$1]++' All_SNPs.txt > All_SNPs_nodup.txt

If it doesn't work I think you need better hardware...

ADD COMMENTlink modified 19 months ago • written 19 months ago by Benn8.0k

OP tried: awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt @ b.nota

ADD REPLYlink written 19 months ago by cpad011214k

Thank you for your suggestion. I have tried this command a few times but unfortunately I get an error: Cannot allocate memory

ADD REPLYlink written 19 months ago by OAJn863460
0
gravatar for Santosh Anand
19 months ago by
Santosh Anand5.1k
Santosh Anand5.1k wrote:

Plink has basic mechanism to deal with dups

--list-duplicate-vars <require-same-ref> <ids-only> <suppress-first>

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

Then the duplicated vars can be excluded using --exclude plink.dupvar

ADD COMMENTlink written 19 months ago by Santosh Anand5.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 908 users visited in the last hour