Question

Retrieve Single Vcf From List Of Snp Rs#

2

Entering edit mode

11.2 years ago

Peixe ▴ 660

Is there any straightforward way to retrieve in a single vcf file data for specified SNPs? In other words, I know how to slice 1000Genomes by genomic coordinates with tabix but, I'd like to do the same, just by specifying the rs# name of the SNP(s), and end up having all data in an unique vcf file.

Any hint?

Thanks!!

tabix vcf 1000genomes • 11k views

ADD COMMENT • link updated 7.8 years ago by Scott ▴ 110 • written 11.2 years ago by Peixe ▴ 660

score 6 · Answer 1 · 2013-01-21

If not using vcftools, the simplest and reasonably fast way is to use awk:

awk 'BEGIN{while((getline<"list.txt")>0)l[$1]=1}/^#/||l[$3]' 1000g.vcf > output.vcf

If you really care about speed, you may use Perl/Python to split the first few fields only instead of wastefully splitting every genotype field. If you do not want to create a temporary file, you can (not recommeded, though):

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz .|awk 'BEGIN{while((getline<"tmp.list")>0)l[$1]=1}/^#/||l[$3]'

You still download the whole file, but does not create a temporary one.

EDIT:

The right way to use grep is:

grep -wFf list.txt output.vcf

we should NOT use just grep -wf. With -F, the patterns are interpreted as fixed strings instead of regular expressions. This will trigger the Aho–Corasick algorithm which is much faster than multi-regex matching. Here is an experiment. tmp.vcf is 1000g site-only VCF on chr20 and tmp.list contains 8582 rs#.

$ time awk 'BEGIN{while((getline<"tmp.list")>0)l[$1]=1}l[$3]' tmp.vcf > /dev/null

real    0m5.318s
user    0m4.365s
sys     0m0.937s

$ time grep -wFf tmp.list tmp.vcf > /dev/null

real    0m2.740s
user    0m1.879s
sys     0m0.768s

$ time grep -wf tmp.list tmp.vcf > /dev/null

(Unfinished. Already 6m50s and counting...)

EDIT 2:

Although grep -wFf is faster, it may give you false matches. For example, if in the INFO field we have FOO=rs1234, a list containing rs1234 will match this line. This scenario rarely happens of course, but you should beware of that.

score 4 · Answer 2 · 2013-01-20

4

Entering edit mode

11.2 years ago

Matt Shirley 10k

First, make a file containing the rsid #'s (rsid_list) that you want. Then you can feed this into grep as a list of patterns to match:

grep '^#' 1000genomes.vcf >> slice.vcf
grep -wF -f rsid_list 1000genomes.vcf >> slice.vcf

The first command creates a header from the vcf file. The second command matches patterns given from a file (-f) as word that must be preceded by a non-word character (-w) such as whitespace. We need the last part because you don't want to match rs123456 when you really want rs1234.

ADD COMMENT • link 11.2 years ago by Matt Shirley 10k

2

Entering edit mode

To use grep -f, you must apply -F as well; otherwise it will be extremely slow. See my answer.

ADD REPLY • link 11.2 years ago by lh3 33k

1

Entering edit mode

Yes, you are correct to point that out. I've edited my answer. Thanks.

ADD REPLY • link 11.2 years ago by Matt Shirley 10k

0

Entering edit mode

Yes, but that will require to first download the 1000genomes.vcf file using genomic coordinates (approach i dont want in my case, because that might turn into a huge number of vcf files) and then apply grep. I was asking how to retrieve the final slice.vcf directly with just the snps... Thanks anyway! ;)

ADD REPLY • link 11.2 years ago by Peixe ▴ 660

2

Entering edit mode

You're going to have to download the entire file anyway, since there is not an indexing method that takes variant names and info strings into account. You would need this to create a byte offset to request just a part of the file the way tabix does. If you're worried about creating a large number of files, you might consider that you can download the entire 1000 Genomes variant calls in one vcf, parse this with grep, and then extract the individual genotypes you need. If you're worried about downloading the entire variant call set then you may consider curlftpfs to mount the ftp site as a folder and then operate on the remote file as if they were local.

ADD REPLY • link 11.2 years ago by Matt Shirley 10k

h.mon · Answer 3 · 2013-01-21

2

Entering edit mode

11.2 years ago

Adam ★ 1.0k

Also, you could use:

vcftools --vcf file.vcf --snps rsID.filename --recode --recode-INFO-all

Still need to download the file tho.

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 11.2 years ago by Adam ★ 1.0k

0

Entering edit mode

Yes, I already knew about this. ;)

ADD REPLY • link 11.2 years ago by Peixe ▴ 660

score 0 · Answer 4 · 2016-06-30

0

Entering edit mode

7.8 years ago

Scott ▴ 110

You can now also extract such data from the UCSC genome browser's Table Browser. http://genome.ucsc.edu/cgi-bin/hgTables

Group: Variation Track: 1000G Ph3 Vars

Specify a list of variants in rsID format under identifiers.

ADD COMMENT • link 7.8 years ago by Scott ▴ 110