Question: Best way to query VCF for specific variants
1
gravatar for Sean
4.4 years ago by
Sean210
United States
Sean210 wrote:

Question:

What is the best and fastest way to query a VCF file for specific variants?

Background:

I have a tab-separated list of variants where the columns are CHROM, POS, REF, ALT. I would like to query a VCF file and get the records associated with only these specific variants. I know BCFtools, VCFtools, and tabix all allow you to supply a regions/positions file to search on CHROM and POS only, but I am interested in searching on CHROM, POS, REF, and ALT.

I also know this is very easy to do with grep, but grep doesn't take advantage of the VCF index file like the other tools do. As a result it is much slower, especially when searching very large VCF files.

vcftools bcftools annotation vcf • 5.6k views
ADD COMMENTlink modified 10 months ago by bari.ballew220 • written 4.4 years ago by Sean210
variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

hello, I was trying to use the above code but there are so many grep errors and, $3,$4"\t"}' "$variants")- for not recognizing the variant file(i used my file with the correct format) etc..errors are coming up, cld u pls share some code which works and I would be great if you could share some code having loops to extract a big list. Thanks

ADD REPLYlink modified 10 months ago by genomax83k • written 10 months ago by lkirola0

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLYlink written 10 months ago by genomax83k
2
gravatar for Sean
4.4 years ago by
Sean210
United States
Sean210 wrote:

Here's a solution that uses a combination of BCFtools, grep, and awk.

First, put your variants in tab-separated file (e.g. variants.txt) with columns CHROM, POS, ALT, and REF. For example:

1   215802298   A   G
1   215844373   C   T
1   215848808   A   G
1   215901574   C   T

Then query your VCF file (e.g. my.vcf.gz) like so:

variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

The output will be a VCF file with just the variants from your original list.

Explanation:

  1. BCFtools quickly queries the VCF file based on CHROM and POS using the first two columns from variants.txt.
  2. The awk command preprocesses variants.txt on-the-fly by adding necessary regular expressions:
    • Add # to the variant list in order to preserve VCF header information in final results
    • Prepend ^ to each variant in order to ensure that grep only searches from the start of each VCF line
    • Add a placeholder regex for each variant's ID field
    • Append a tab to each variant to ensure that grep does not include results for other indels in the VCF
  3. Lastly grep filters the results from BCFtools based on CHROM, POS, REF and ALT to only include variants from the original variant list
ADD COMMENTlink modified 10 months ago by RamRS27k • written 4.4 years ago by Sean210
0
gravatar for bari.ballew
10 months ago by
bari.ballew220
USA/NIH
bari.ballew220 wrote:

This question is rather old and has an accepted answer, but I'll add another option since Ikirola has found the post and is having trouble.

Try bcftools isec. Make a minimal sites-only vcf containing your variants of interest, something like this:

##fileformat=VCFv4.3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   123 .   A   G   .   .   .
2   345 .   G   T   .   .   .

Bgzip and index your VCFs. Now use bcftools isec to find the intersection of your VCF and your variants of interest: bcftools isec -c none -p dir -n=2 -w1 vars_of_interest.vcf original.vcf

Explanation:

-c none is the default so you don't really need to include it in the command, but it tells bcftools to consider two variants as identical only if their chromosome, pos, ref, and alt are all identical. Note that this means A>G and A>G,C are NOT identical.

-p dir creates a subdirectory called "dir" to write your output.

-n=2 tells bcftools isec to output variants present in exactly two files.

-w1 Without this option, bcftools will print two files, one with the results of A that overlap with B, and another with the results of B that overlap with A. Since those would be redundant, we use this option to just print the one file.

ADD COMMENTlink written 10 months ago by bari.ballew220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour