Best way to query VCF for specific variants
2
3
Entering edit mode
6.7 years ago
Sean ▴ 310

Question:

What is the best and fastest way to query a VCF file for specific variants?

Background:

I have a tab-separated list of variants where the columns are CHROM, POS, REF, ALT. I would like to query a VCF file and get the records associated with only these specific variants. I know BCFtools, VCFtools, and tabix all allow you to supply a regions/positions file to search on CHROM and POS only, but I am interested in searching on CHROM, POS, REF, and ALT.

I also know this is very easy to do with grep, but grep doesn't take advantage of the VCF index file like the other tools do. As a result it is much slower, especially when searching very large VCF files.

vcf vcftools annotation bcftools • 14k views
0
Entering edit mode
variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
| grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \ > "$vcf_out"


hello, I was trying to use the above code but there are so many grep errors and, $3,$4"\t"}' "$variants")- for not recognizing the variant file(i used my file with the correct format) etc..errors are coming up, cld u pls share some code which works and I would be great if you could share some code having loops to extract a big list. Thanks ADD REPLY 0 Entering edit mode Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question. ADD REPLY 3 Entering edit mode 6.7 years ago Sean ▴ 310 Here's a solution that uses a combination of BCFtools, grep, and awk. First, put your variants in tab-separated file (e.g. variants.txt) with columns CHROM, POS, ALT, and REF. For example: 1 215802298 A G 1 215844373 C T 1 215848808 A G 1 215901574 C T  Then query your VCF file (e.g. my.vcf.gz) like so: variants='variants.txt' vcf_in='my.vcf.gz' vcf_out='variants_of_interest.vcf' bcftools view -O v -R "$variants" "$vcf_in" \ | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
> "$vcf_out"  The output will be a VCF file with just the variants from your original list. Explanation: 1. BCFtools quickly queries the VCF file based on CHROM and POS using the first two columns from variants.txt. 2. The awk command preprocesses variants.txt on-the-fly by adding necessary regular expressions: • Add # to the variant list in order to preserve VCF header information in final results • Prepend ^ to each variant in order to ensure that grep only searches from the start of each VCF line • Add a placeholder regex for each variant's ID field • Append a tab to each variant to ensure that grep does not include results for other indels in the VCF 3. Lastly grep filters the results from BCFtools based on CHROM, POS, REF and ALT to only include variants from the original variant list ADD COMMENT 1 Entering edit mode I wondered if anyone could help me please? (apologies I can't give screen shots - as I work on a secure server). I want to extract all the sample IDs that relate to a specific variant. I have a multisample VCF file. My variant text file (tab delimited) contains: Chr Pos Ref Alt I used to be able to extract this information using the code below and get all the IDs seperated by columns with the relevant VCF info (qual, filter, info, format, format_ouput). bcftools view -R variants.txt mymultisample.vcf > newoutput.vcf  However, now all I get is a few columns with lots of mixed mish mash of information that doesn't make sense. I also tried bcftools view -t, bcftools view -r, different formatting of the variant.txt file but no luck. I tried also the above code provided by Ram : bcftools view -O v -R "$variants" "$vcf_in" \ | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
> "\$vcf_out"


However, i only get a long line of IDs within one of the excel cells but no other information that I was previously getting.

Could it be the vcf fiel itself that is the problem?

Many thanks, Julia

0
Entering edit mode

Anyone could help at all with my above question? Julia

1
Entering edit mode
3.1 years ago
bari.ballew ▴ 360

This question is rather old and has an accepted answer, but I'll add another option since Ikirola has found the post and is having trouble.

Try bcftools isec. Make a minimal sites-only vcf containing your variants of interest, something like this:

##fileformat=VCFv4.3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   123 .   A   G   .   .   .
2   345 .   G   T   .   .   .


Bgzip and index your VCFs. Now use bcftools isec to find the intersection of your VCF and your variants of interest: bcftools isec -c none -p dir -n=2 -w1 vars_of_interest.vcf original.vcf

Explanation:

-c none is the default so you don't really need to include it in the command, but it tells bcftools to consider two variants as identical only if their chromosome, pos, ref, and alt are all identical. Note that this means A>G and A>G,C are NOT identical.

-p dir creates a subdirectory called "dir" to write your output.

-n=2 tells bcftools isec to output variants present in exactly two files.

-w1 Without this option, bcftools will print two files, one with the results of A that overlap with B, and another with the results of B that overlap with A. Since those would be redundant, we use this option to just print the one file.