I have a bgzip vcf file and build tabix index already, but I am wondering if tabix can query VCF file by ID, just like:
tabix my.vcf.gz id='rs527352437'
or is there another way to query dbSNP by ID quickly?
I have a bgzip vcf file and build tabix index already, but I am wondering if tabix can query VCF file by ID, just like:
tabix my.vcf.gz id='rs527352437'
or is there another way to query dbSNP by ID quickly?
Here's an approach that uses hg38 SNP data from UCSC, an rs* ID of interest, and pts-line-bisect to do a binary search.
Your VCF-formatted data will be in different columns, but the principle here is the same — you permute the ID into the first column of a text file. You sort that text file on the first column. You do a binary search on the sorted text file.
Binary searches on sorted data are fast — log n fast — resulting in searches that take fractions of a second where they would otherwise take minutes or longer.
1) Get SNPs and write them into a text file sorted by SNP ID. For hg38, for instance:
$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
   | sort -k1,1 \
   > hg38.snp147.sortedByName.txt
2) Install pts-line-bisect:
$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..
3) Run a query:
$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | cut -f2- \
   > result.bed
In your case, you could modify your VCF file to permute the columns, so that the ID is in the first column of all columns. You permute and then sort as in step 1. This is the file you query with pts-line-bisect. Whatever result you get — if any — you permute back into VCF field ordering, likely using awk or similar.
With some modifications, it should be possible to do a binary search on a sorted, compressed text file. An important consideration is that the sort order is stable, so that compressed blocks provide a consistent or "deterministic" direction to go in finding the right compressed block for the data being queried.
It can work, but not well. The time cost of pts-line-bisect is O(s * log n), where s is average length of each line, and n is line number, according to pts_line_bisect' page. If there's some specified vcf index tool just like tabix and can query by ID column, with time cost O(1).
I use this approach for a web frontend, which translates the user's SNP ID of interest into a genomic position. The web frontend redirects the user to a custom data browser centered at that SNP's position within milliseconds. The SNP file in question — the same dbSNP build 147 dataset in my answer — is a 5.5 GB text file, uncompressed. Searches against this file take a fraction of a second. For this application, I'm pretty happy with that result.
$ zgrep -w rs34875155 clinvar.vcf.gz
No need to use tabix for system function, IMHO if you want only records. If you want headers and do it with a vcf filtering tool, then you can use snpsift. Example script to filter a vcf by a single rs ID is furnished below:(output is not pasted here as header lines in clinvar are huge in number):
$ java -jar /opt/snpEff/SnpSift.jar filter "(ID =~ 'rs34875155')" clinvarvcf.gz
if you have multiple rsids in a file, follow example provided by snpsift manual.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello, did you find good solution? I need to find position for some SNP in
dbSNP.vcfbyrsID. I am currently usingawkcommand, but its really slow, because dbSNP database is huge.you can use snpsift or bcftools: MatthewP
bgzip dbsnp vcf and index bgzipped file.