Question

search a large dbsnp file using another most efficiently

0

Entering edit mode

6.8 years ago

bioguy24 ▴ 230

I am trying to find the faster (most efficient) tool for searching a very large ~100GB file. The input that is searched for is file1, which is just a list of rs#'s in a column (1 per line) --- there may be several hundred ---. File2 is a sorted list of hg19 HGVS nomenclature for each SNP from dbSNP. I have tried a variety of grep, awk, and ack and they all seem to work but maybe there is a better approach. File2 is also the desired result in this case as well as I am just trying to return the lines in file2 that match file1. I read parallel may help so I tried that but maybe there is another solution. The awk is closer to what I need as it searched all of the lines in file1. Thank you :).

ack

ack "(^2307492)|(^7349185)" file2

grep

cat file2.txt | grep -P "\t7349185" | "\t2307492"

awk

awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2

awk

BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $1
val[$1][c] = $NF
next
}
$2 in val {
for (c=1; c<=num[$1]; c++) {
    if ( (beg[$1][c] = $2) ) {
        print $0, val[$1][c]
        break
    }
}
}

awk -f script.awk file1 file2

file1

rs7349185
rs2307492

file2

NC_000001.10:g.26131654G>A  7349185
NC_000001.11:g.25805163G>A  7349185
NG_009930.1:g.9988G>A   7349185
NM_020451.2:c.425G>A    7349185
NM_206926.1:c.323G>A    7349185
NP_065184.2:p.Cys142Tyr 7349185
NP_996809.1:p.Cys108Tyr 7349185
NC_000001.10:g.171168545T>C 2307492
NC_000001.11:g.171199406T>C 2307492
NM_001301347.1:c.-34+2595T>C    2307492
NM_001460.4:c.545T>C    2307492
NP_001451.2:p.Phe182Ser 2307492

parallel

parallel --pipe --block 2M grep 23074920| 7349185 < file2

ngs • 1.4k views

ADD COMMENT • link 6.8 years ago by bioguy24 ▴ 230

1

Entering edit mode

you can also use grep -f file1 file2.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

grep -f file1 file2 and awk -F'\t' 'NR==FNR{A[$1];next}$2 in A' also work but are pretty slow on these large files. Thank you :).

ADD REPLY • link 6.8 years ago by bioguy24 ▴ 230

1

Entering edit mode

This strictly isn't a bioinformatics question though it uses informatics data. You may have better luck searching/asking stackoverflow/stackexchange. Here are example1, example2 of similar questions.