Question: Remove duplicates in an extremely large text file
2
gravatar for OAJn8634
6 months ago by
OAJn863450
OAJn863450 wrote:

I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:

1:116342    rs1000277323
1:173516    rs1000447106
1:168592    rs1000479828
1:102498    rs1000493007

However, plink produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:

awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt

sort  All_SNPs.txt | uniq -u > All_SNPs_nodup.txt

cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt

However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq, and awk).

I will be very grateful for any ideas of how I can make this work. Thank you very much.

PS, this file is far too large to open it in R.

awk snp plink uniq • 472 views
ADD COMMENTlink modified 6 months ago by Santosh Anand4.9k • written 6 months ago by OAJn863450
2

You could try to convert the input file to a valid vcf and use than bcftools sort and bcftools norm -N -d none to remove the duplicates. At the end you can convert back to the input format.

ADD REPLYlink written 6 months ago by finswimmer12k

Using datamash:

 datamash -sg 1 unique 2  <test.txt

datamash is available in brew, conda, apt repos.

using tsv-utils :

tsv-uniq --ignore-case -H -f 1 test.txt
ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011211k

Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.

ADD REPLYlink written 6 months ago by geek_y9.8k
4
gravatar for bruce.moran
6 months ago by
bruce.moran620
Ireland
bruce.moran620 wrote:

Cut and uniq to find duplicates, then grep -v them away. Relatively quick on 389MB dummy file.

time sort -V All_SNPs.txt | cut -f 1 | uniq -c | perl -ane 'if($F[0] ne "1"){print "$F[1]\t$F[0]\n";'} > All_SNPs.dup.chr-pos.txt

real    0m57.428s
user    3m19.091s
sys     0m3.152s

time cut -f 1 All_SNPs.dup.chr-pos.txt | grep -wvf - All_SNPs.txt  > All_SNPs.nodup.txt

real    0m2.516s
user    0m1.697s
sys     0m0.229s
ADD COMMENTlink written 6 months ago by bruce.moran620
2
gravatar for Shred
6 months ago by
Shred150
Shred150 wrote:

Split the text file into smaller one using the split command. You could split by size, as example:

split -b 200m filename

This will produce files named ' xaa, xab, xac..' . Now use awk, but with simpler syntax

awk -F"\t" '!seen[$1]++' xa*

And after that, join files using a sample cat into the destination file.

ADD COMMENTlink written 6 months ago by Shred150
2

How big is the chance that two dups end up in different files?

ADD REPLYlink written 6 months ago by Benn7.4k
2

You can use awk to split your file according to the chromosomes:

awk '{ split($1, a, ":"); print $1"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

That assures you that you don't miss two dups. If you like to reduce the file size a bit, you can remove the chr: :

awk '{ split($1, a, ":"); print a[2]"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

This will also allow you to keep a bit more items in the hash.

ADD REPLYlink written 6 months ago by michael.ante3.3k

Shit happens. But if a file is too large for ram, in bash there's no way to map into memory. Maybe a solution would be in python, as explained here

ADD REPLYlink written 6 months ago by Shred150

Yeah the link is worth a try, or here another awk solution. Can't test it myself, though.

ADD REPLYlink modified 6 months ago • written 6 months ago by Benn7.4k

If you sort first I guess the chance is very small no?

ADD REPLYlink written 6 months ago by Gautier Richard280

Sort loads file into memory too.

ADD REPLYlink written 6 months ago by Shred150

Do you actually have this as a PLINK dataset? Why not try to use PLINK functionality to update the map file? For example, --list-duplicate-vars lists duplicates, which can then be excluded

ADD REPLYlink written 6 months ago by Kevin Blighe46k

This has worked. Thank you very much for the suggestion!

ADD REPLYlink written 6 months ago by OAJn863450
0
gravatar for Benn
6 months ago by
Benn7.4k
Netherlands
Benn7.4k wrote:

Did you try:

awk '!seen[$1]++' All_SNPs.txt > All_SNPs_nodup.txt

If it doesn't work I think you need better hardware...

ADD COMMENTlink modified 6 months ago • written 6 months ago by Benn7.4k

OP tried: awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt @ b.nota

ADD REPLYlink written 6 months ago by cpad011211k

Thank you for your suggestion. I have tried this command a few times but unfortunately I get an error: Cannot allocate memory

ADD REPLYlink written 6 months ago by OAJn863450
0
gravatar for Santosh Anand
6 months ago by
Santosh Anand4.9k
Santosh Anand4.9k wrote:

Plink has basic mechanism to deal with dups

--list-duplicate-vars <require-same-ref> <ids-only> <suppress-first>

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

Then the duplicated vars can be excluded using --exclude plink.dupvar

ADD COMMENTlink written 6 months ago by Santosh Anand4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 532 users visited in the last hour