Fast way to return genomic regions for 10,000 SNPs in rsID
3
0
Entering edit mode
3.3 years ago
Shicheng Guo ★ 9.0k

Hi All,

Suppose I have 10,000 SNPs with rsID, I want to return the genomic regions for these rsID as a bed file. What's the fast way? I ask this question because if I use vcftools --snps, it is very very slow. Furthermore, I can submit the rsID to SNPnexus, but I just want to some Linux command which I don't need to copy, upload download, cut and reformat. Done! bcftools is a little faster, however, still very slow, even I split the job to 35 PBS jobs.

bcftools view -i 'ID=@InnateDB.UTR3.rsid.txt' ~/hpc/db/hg19/All_20180423.vcf | grep -v '#' | awk '{print $1,$2,$3,$4,$5}' OFS="\t"

Thanks.

Test result 1: SNPnexus used 5 hours to complete the job for 20,000 SNPs with only genomic position and East Asian Allele frequency selection. the problem is even it is completed, the excel is 108M, I cannot open such big excel with win7.

SNP rsid bed • 1.9k views
ADD COMMENT
0
Entering edit mode

Does this mean this question is solved? Please separate your post into question and answer, post the answer below.

ADD REPLY
0
Entering edit mode

Still very low. Wait for some other faster solution.

ADD REPLY
0
Entering edit mode

Obviously, all the index method is target to genomic region, rather than rsID, therefore, if you search by rsID, it is always very very slow from gnomad.genomes.r2.1.sites.vcf.bgz or All_20180423.vcf.bgz. I think the only solution maybe build an index version to rsID.

ADD REPLY
4
Entering edit mode
3.3 years ago

It's a hack, but if you use e.g. awk to add a fake sample + genotype column (could be all "./.") to your rsID VCF file, you can then use

plink2 --vcf All_20180423.vcf --extract my_rsids.txt --export vcf

to generate a VCF with only the rsIDs you care about, and it's straightforward to finish the job from there. I'd expect a better solution to exist, but plink2's variant ID hash table is quite efficient.

Edit: turns out you can do it without adding the fake column by replacing "--export vcf" with "--make-just-pvar". plink2 usually requires genotype data, but there are still a few things you can do without it.

ADD COMMENT
0
Entering edit mode

It is really great idea!!!! awesome!

ADD REPLY
0
Entering edit mode

How about to let @Alex Reynolds to provide a method to build this fake vcf?

ADD REPLY
4
Entering edit mode
3.3 years ago

If you want to do things locally, you can make a searchable resource you can query as often you like.

1) Get SNPs and write them into a text file sorted by SNP ID.

For hg19, for instance, here is a way to get all SNPs into a form queryable with a binary search:

$ LC_ALL=C
$ wget -qO- https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz \
   | gunzip -c \
   | vcf2bed --sort-tmpdir=${PWD} - \
   | awk -v FS="\t" -v OFS="\t" '{ print $4, $1, $2, $3 }' \
   | sort -k1,1 \
   > hg19.dbSNPAllV151.sortedByName.txt

This will take a while to build and will take considerable disk space, compared with the compressed option.

You may also want additional columns, rather than the position and ID alone. Use head to inspect the output of awk and decide what you want to keep and discard. Read the documentation for vcf2bed to be sure that you understand how one-indexed VCF records are turned into zero-indexed BED positions.

However you do this step, you should ultimately have the rsID in the first column, and you want the variants sorted lexicographically by rsID.

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg19.dbSNPAllV151.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | awk -v FS="\t" -v OFS="\t" '{ print $2, $3, $4, $1 }'

Step 3 can be put into a script so that you can re-run it with your SNP of interest on demand.

For instance:

#!/bin/bash
pts_lbsearch_bin=./pts-line-bisect/pts_lbsearch
sorted_snp_txt=hg19.dbSNPAllV151.sortedByName.txt
${pts_lbsearch_bin} -p ${sorted_snp_txt} $1 | head -1 | awk -v FS="\t" -v OFS="\t" '{ print $2, $3, $4, $1 }'

Then:

$ ./search_snps.sh rs2814778 > rs2814778.bed

Use scripting to loop through a list of SNPs, to get however many positions are needed. Pipe all results to sort-bed to get a sorted BED file.

ADD COMMENT
0
Entering edit mode

Thanks so much Alex. I will test it now. Meanwhile, I will test this method at the same time. Meanwhile, thanks for the reminder that "one-indexed VCF records are turned into zero-indexed BED positions". Finally, can you explain a little more about pts_lbsearch in the 3th step. Thanks.

bcftools view gnomad.genomes.r2.1.sites.vcf.bgz | grep -v '#' | awk 'print $1,$2-1,$2,$3,$4,$5' OFS="\t" > gnomad.hg19.bed
grep -F rs554008981 gnomad.hg19.bed
grep -Ff rsIDs.txt gnomad.hg19.bed > rsIDs.hg19.bed
ADD REPLY
4
Entering edit mode
3.3 years ago

Another option is to use vcf2bed to turn dbSNP VCF into sorted BED, and then grep -wFf to use a literal word match search from a file of rsIDs, as suggested in my answer here:

A: Search dbSNP for hg19 based coordinates

I haven't timed how this compares with a binary search, but I imagine that the binary search will be quite faster overall for multiple searches, in that it isn't necessary to read through the entire presorted file with a binary search method, whereas grep would have to look through everything. If you're doing a one-off thing, using grep might take the least time, overall.

The binary search approach would be great if you plan to query this frequently, as the time cost in sorting is amortized over the query time speedup you get if you hit up this thing a lot.

Still another option, while spitballing, is to use an in-memory caching database like Redis to make a table of key-value pairs, where the key is the rsID and the value is the position, e.g., using HSET and HGET to build and query, respectively. This would be useful for building a web service, for instance, or other resource you want to hit up frequently.

Using the constant-time lookup properties of a hash table will be even faster than a binary search, at the cost of using more memory. It is perhaps measuring the balance of disk space (which is cheap and relatively slow) against RAM (which is generally expensive, but fast), against utility (are you doing one-off or repeated searches, etc.), to decide which way to go.

Experiment and report back what works for you — or what doesn't!

ADD COMMENT

Login before adding your answer.

Traffic: 2622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6