Question: search a large dbsnp file using another most efficiently
0
gravatar for bioguy24
3.3 years ago by
bioguy24190
Chicago
bioguy24190 wrote:

I am trying to find the faster (most efficient) tool for searching a very large ~100GB file. The input that is searched for is file1, which is just a list of rs#'s in a column (1 per line) --- there may be several hundred ---. File2 is a sorted list of hg19 HGVS nomenclature for each SNP from dbSNP. I have tried a variety of grep, awk, and ack and they all seem to work but maybe there is a better approach. File2 is also the desired result in this case as well as I am just trying to return the lines in file2 that match file1. I read parallel may help so I tried that but maybe there is another solution. The awk is closer to what I need as it searched all of the lines in file1. Thank you :).

ack

ack "(^2307492)|(^7349185)" file2

grep

cat file2.txt | grep -P "\t7349185" | "\t2307492"

awk

awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2

awk

BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $1
val[$1][c] = $NF
next
}
$2 in val {
for (c=1; c<=num[$1]; c++) {
    if ( (beg[$1][c] = $2) ) {
        print $0, val[$1][c]
        break
    }
}
}

awk -f script.awk file1 file2

file1

rs7349185
rs2307492

file2

NC_000001.10:g.26131654G>A  7349185
NC_000001.11:g.25805163G>A  7349185
NG_009930.1:g.9988G>A   7349185
NM_020451.2:c.425G>A    7349185
NM_206926.1:c.323G>A    7349185
NP_065184.2:p.Cys142Tyr 7349185
NP_996809.1:p.Cys108Tyr 7349185
NC_000001.10:g.171168545T>C 2307492
NC_000001.11:g.171199406T>C 2307492
NM_001301347.1:c.-34+2595T>C    2307492
NM_001460.4:c.545T>C    2307492
NP_001451.2:p.Phe182Ser 2307492

parallel

parallel --pipe --block 2M grep 23074920| 7349185 < file2
ngs • 892 views
ADD COMMENTlink written 3.3 years ago by bioguy24190
1

you can also use grep -f file1 file2.

ADD REPLYlink written 3.3 years ago by WouterDeCoster44k

grep -f file1 file2 and awk -F'\t' 'NR==FNR{A[$1];next}$2 in A' also work but are pretty slow on these large files. Thank you :).

ADD REPLYlink written 3.3 years ago by bioguy24190
1

This strictly isn't a bioinformatics question though it uses informatics data. You may have better luck searching/asking stackoverflow/stackexchange. Here are example1, example2 of similar questions.

ADD REPLYlink written 3.3 years ago by genomax91k
1

please sort your files on the desired columns and add a test for join

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1294 users visited in the last hour