Question: Best way to query VCF for specific variants
0
gravatar for Sean
3.4 years ago by
Sean130
United States
Sean130 wrote:

Question:

What is the best and fastest way to query a VCF file for specific variants?

Background:

I have a tab-separated list of variants where the columns are CHROM, POS, REF, ALT. I would like to query a VCF file and get the records associated with only these specific variants. I know BCFtools, VCFtools, and tabix all allow you to supply a regions/positions file to search on CHROM and POS only, but I am interested in searching on CHROM, POS, REF, and ALT.

I also know this is very easy to do with grep, but grep doesn't take advantage of the VCF index file like the other tools do. As a result it is much slower, especially when searching very large VCF files.

vcftools bcftools annotation vcf • 3.5k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Sean130
1
gravatar for Sean
3.4 years ago by
Sean130
United States
Sean130 wrote:

Here's a solution that uses a combination of BCFtools, grep, and awk.

First, put your variants in tab-separated file (e.g. variants.txt) with columns CHROM, POS, ALT, and REF. For example:

1   215802298   A   G
1   215844373   C   T
1   215848808   A   G
1   215901574   C   T

Then query your VCF file (e.g. my.vcf.gz) like so:

variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

The output will be a VCF file with just the variants from your original list.

Explanation:

  1. BCFtools quickly queries the VCF file based on CHROM and POS using the first two columns from variants.txt.
  2. The awk command preprocesses variants.txt on-the-fly by adding necessary regular expressions:
    • Add '#' to the variant list in order to preserve VCF header information in final results
    • Prepend '^' to each variant in order to ensure that grep only searches from the start of each VCF line
    • Add a placeholder regex for each variant's ID field
    • Append a tab to each variant to ensure that grep does not include results for other indels in the VCF
  3. Lastly grep filters the results from BCFtools based on CHROM, POS, REF and ALT to only include variants from the original variant list
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Sean130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1427 users visited in the last hour