Question: Remove duplicates in an extremely large text file
2
gravatar for OAJn8634
15 days ago by
OAJn863450
OAJn863450 wrote:

I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:

1:116342    rs1000277323
1:173516    rs1000447106
1:168592    rs1000479828
1:102498    rs1000493007

However, plink produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:

awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt

sort  All_SNPs.txt | uniq -u > All_SNPs_nodup.txt

cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt

However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq, and awk).

I will be very grateful for any ideas of how I can make this work. Thank you very much.

PS, this file is far too large to open it in R.

awk snp plink uniq • 251 views
ADD COMMENTlink modified 15 days ago by Santosh Anand4.6k • written 15 days ago by OAJn863450
2

You could try to convert the input file to a valid vcf and use than bcftools sort and bcftools norm -N -d none to remove the duplicates. At the end you can convert back to the input format.

ADD REPLYlink written 15 days ago by finswimmer10k

Using datamash:

 datamash -sg 1 unique 2  <test.txt

datamash is available in brew, conda, apt repos.

using tsv-utils :

tsv-uniq --ignore-case -H -f 1 test.txt
ADD REPLYlink modified 15 days ago • written 15 days ago by cpad011211k

Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.

ADD REPLYlink written 15 days ago by geek_y9.1k
4
gravatar for bruce.moran
15 days ago by
bruce.moran510
Ireland
bruce.moran510 wrote:

Cut and uniq to find duplicates, then grep -v them away. Relatively quick on 389MB dummy file.

time sort -V All_SNPs.txt | cut -f 1 | uniq -c | perl -ane 'if($F[0] ne "1"){print "$F[1]\t$F[0]\n";'} > All_SNPs.dup.chr-pos.txt

real    0m57.428s
user    3m19.091s
sys     0m3.152s

time cut -f 1 All_SNPs.dup.chr-pos.txt | grep -wvf - All_SNPs.txt  > All_SNPs.nodup.txt

real    0m2.516s
user    0m1.697s
sys     0m0.229s
ADD COMMENTlink written 15 days ago by bruce.moran510
2
gravatar for Shred
15 days ago by
Shred150
Shred150 wrote:

Split the text file into smaller one using the split command. You could split by size, as example:

split -b 200m filename

This will produce files named ' xaa, xab, xac..' . Now use awk, but with simpler syntax

awk -F"\t" '!seen[$1]++' xa*

And after that, join files using a sample cat into the destination file.

ADD COMMENTlink written 15 days ago by Shred150
2

How big is the chance that two dups end up in different files?

ADD REPLYlink written 15 days ago by b.nota6.2k
2

You can use awk to split your file according to the chromosomes:

awk '{ split($1, a, ":"); print $1"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

That assures you that you don't miss two dups. If you like to reduce the file size a bit, you can remove the chr: :

awk '{ split($1, a, ":"); print a[2]"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

This will also allow you to keep a bit more items in the hash.

ADD REPLYlink written 15 days ago by michael.ante3.0k

Shit happens. But if a file is too large for ram, in bash there's no way to map into memory. Maybe a solution would be in python, as explained here

ADD REPLYlink written 15 days ago by Shred150

Yeah the link is worth a try, or here another awk solution. Can't test it myself, though.

ADD REPLYlink modified 15 days ago • written 15 days ago by b.nota6.2k

If you sort first I guess the chance is very small no?

ADD REPLYlink written 15 days ago by Gautier Richard240

Sort loads file into memory too.

ADD REPLYlink written 15 days ago by Shred150

Do you actually have this as a PLINK dataset? Why not try to use PLINK functionality to update the map file? For example, --list-duplicate-vars lists duplicates, which can then be excluded

ADD REPLYlink written 15 days ago by Kevin Blighe37k

This has worked. Thank you very much for the suggestion!

ADD REPLYlink written 15 days ago by OAJn863450
0
gravatar for b.nota
15 days ago by
b.nota6.2k
Netherlands
b.nota6.2k wrote:

Did you try:

awk '!seen[$1]++' All_SNPs.txt > All_SNPs_nodup.txt

If it doesn't work I think you need better hardware...

ADD COMMENTlink modified 15 days ago • written 15 days ago by b.nota6.2k

OP tried: awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt @ b.nota

ADD REPLYlink written 15 days ago by cpad011211k

Thank you for your suggestion. I have tried this command a few times but unfortunately I get an error: Cannot allocate memory

ADD REPLYlink written 15 days ago by OAJn863450
0
gravatar for Santosh Anand
15 days ago by
Santosh Anand4.6k
Santosh Anand4.6k wrote:

Plink has basic mechanism to deal with dups

--list-duplicate-vars <require-same-ref> <ids-only> <suppress-first>

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

Then the duplicated vars can be excluded using --exclude plink.dupvar

ADD COMMENTlink written 15 days ago by Santosh Anand4.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour