I am trying to find the faster (most efficient) tool for searching a very large ~100GB file. The input that is searched for is file1, which is just a list of rs#'s in a column (1 per line) --- there may be several hundred ---. File2 is a sorted list of hg19 HGVS nomenclature for each SNP from dbSNP. I have tried a variety of grep, awk, and ack and they all seem to work but maybe there is a better approach. File2 is also the desired result in this case as well as I am just trying to return the lines in file2 that match file1. I read parallel may help so I tried that but maybe there is another solution. The awk is closer to what I need as it searched all of the lines in file1. Thank you :).
ack
ack "(^2307492)|(^7349185)" file2
grep
cat file2.txt | grep -P "\t7349185" | "\t2307492"
awk
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
awk
BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $1
val[$1][c] = $NF
next
}
$2 in val {
for (c=1; c<=num[$1]; c++) {
if ( (beg[$1][c] = $2) ) {
print $0, val[$1][c]
break
}
}
}
awk -f script.awk file1 file2
file1
rs7349185
rs2307492
file2
NC_000001.10:g.26131654G>A 7349185
NC_000001.11:g.25805163G>A 7349185
NG_009930.1:g.9988G>A 7349185
NM_020451.2:c.425G>A 7349185
NM_206926.1:c.323G>A 7349185
NP_065184.2:p.Cys142Tyr 7349185
NP_996809.1:p.Cys108Tyr 7349185
NC_000001.10:g.171168545T>C 2307492
NC_000001.11:g.171199406T>C 2307492
NM_001301347.1:c.-34+2595T>C 2307492
NM_001460.4:c.545T>C 2307492
NP_001451.2:p.Phe182Ser 2307492
parallel
parallel --pipe --block 2M grep 23074920| 7349185 < file2
you can also use
grep -f file1 file2
.grep -f file1 file2 and awk -F'\t' 'NR==FNR{A[$1];next}$2 in A' also work but are pretty slow on these large files. Thank you :).
This strictly isn't a bioinformatics question though it uses informatics data. You may have better luck searching/asking stackoverflow/stackexchange. Here are example1, example2 of similar questions.
please sort your files on the desired columns and add a test for join