Question: VCF tabix query by ID column
1
gravatar for liangzebin5566
3.3 years ago by
liangzebin556630 wrote:

I have a bgzip vcf file and build tabix index already, but I am wondering if tabix can query VCF file by ID, just like:

tabix my.vcf.gz id='rs527352437'

or is there another way to query dbSNP by ID quickly?

tabix vcf • 2.5k views
ADD COMMENTlink modified 3.3 years ago by Alex Reynolds31k • written 3.3 years ago by liangzebin556630

Hello, did you find good solution? I need to find position for some SNP in dbSNP.vcf by rsID . I am currently using awk command, but its really slow, because dbSNP database is huge.

ADD REPLYlink written 23 months ago by MatthewP850
1

you can use snpsift or bcftools: MatthewP

java -jar /opt/snpEff/SnpSift.jar filter "(ID =~ 'rs34875155')" clinvar.vcf.gz
bcftools view -i '%ID=="rs5771206"' clinvar.vcf,gz

bgzip dbsnp vcf and index bgzipped file.

ADD REPLYlink modified 23 months ago • written 23 months ago by cpad011214k
5
gravatar for Alex Reynolds
3.3 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Here's an approach that uses hg38 SNP data from UCSC, an rs* ID of interest, and pts-line-bisect to do a binary search.

Your VCF-formatted data will be in different columns, but the principle here is the same — you permute the ID into the first column of a text file. You sort that text file on the first column. You do a binary search on the sorted text file.

Binary searches on sorted data are fast — log n fast — resulting in searches that take fractions of a second where they would otherwise take minutes or longer.

1) Get SNPs and write them into a text file sorted by SNP ID. For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
   | sort -k1,1 \
   > hg38.snp147.sortedByName.txt

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | cut -f2- \
   > result.bed

In your case, you could modify your VCF file to permute the columns, so that the ID is in the first column of all columns. You permute and then sort as in step 1. This is the file you query with pts-line-bisect. Whatever result you get — if any — you permute back into VCF field ordering, likely using awk or similar.

With some modifications, it should be possible to do a binary search on a sorted, compressed text file. An important consideration is that the sort order is stable, so that compressed blocks provide a consistent or "deterministic" direction to go in finding the right compressed block for the data being queried.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Alex Reynolds31k

It can work, but not well. The time cost of pts-line-bisect is O(s * log n), where s is average length of each line, and n is line number, according to pts_line_bisect' page. If there's some specified vcf index tool just like tabix and can query by ID column, with time cost O(1).

ADD REPLYlink written 3.3 years ago by liangzebin556630

As s is a constant, a binary search is O(log n). Tabix is not constant time. If you want O(1), you'll need to read the SNP database into a hash table, using the SNP ID as the key, and the rest of the SNP record as the value, in the hash table's key-value pairing.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Alex Reynolds31k

I use this approach for a web frontend, which translates the user's SNP ID of interest into a genomic position. The web frontend redirects the user to a custom data browser centered at that SNP's position within milliseconds. The SNP file in question — the same dbSNP build 147 dataset in my answer — is a 5.5 GB text file, uncompressed. Searches against this file take a fraction of a second. For this application, I'm pretty happy with that result.

ADD REPLYlink written 3.3 years ago by Alex Reynolds31k

You're right, I make a misunderstood of O(log n), it is quite small. Thank you very much!

ADD REPLYlink written 3.3 years ago by liangzebin556630
0
gravatar for cpad0112
3.3 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:
$ zgrep -w rs34875155 clinvar.vcf.gz

No need to use tabix for system function, IMHO if you want only records. If you want headers and do it with a vcf filtering tool, then you can use snpsift. Example script to filter a vcf by a single rs ID is furnished below:(output is not pasted here as header lines in clinvar are huge in number):

$ java -jar /opt/snpEff/SnpSift.jar filter "(ID =~ 'rs34875155')" clinvarvcf.gz

if you have multiple rsids in a file, follow example provided by snpsift manual.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1480 users visited in the last hour
_