Best way to query VCF for specific variants
4
7
Entering edit mode
8.3 years ago
Sean ▴ 390

Question:

What is the best and fastest way to query a VCF file for specific variants?

Background:

I have a tab-separated list of variants where the columns are CHROM, POS, REF, ALT. I would like to query a VCF file and get the records associated with only these specific variants. I know BCFtools, VCFtools, and tabix all allow you to supply a regions/positions file to search on CHROM and POS only, but I am interested in searching on CHROM, POS, REF, and ALT.

I also know this is very easy to do with grep, but grep doesn't take advantage of the VCF index file like the other tools do. As a result it is much slower, especially when searching very large VCF files.

vcf vcftools annotation bcftools • 26k views
ADD COMMENT
0
Entering edit mode
variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

hello, I was trying to use the above code but there are so many grep errors and, $3,$4"\t"}' "$variants")- for not recognizing the variant file(i used my file with the correct format) etc..errors are coming up, cld u pls share some code which works and I would be great if you could share some code having loops to extract a big list. Thanks

ADD REPLY
1
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY
5
Entering edit mode
8.3 years ago
Sean ▴ 390

Here's a solution that uses a combination of BCFtools, grep, and awk.

First, put your variants in tab-separated file (e.g. variants.txt) with columns CHROM, POS, ALT, and REF. For example:

1   215802298   A   G
1   215844373   C   T
1   215848808   A   G
1   215901574   C   T

Then query your VCF file (e.g. my.vcf.gz) like so:

variants='variants.txt'
vcf_in='my.vcf.gz'
vcf_out='variants_of_interest.vcf'

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

The output will be a VCF file with just the variants from your original list.

Explanation:

  1. BCFtools quickly queries the VCF file based on CHROM and POS using the first two columns from variants.txt.
  2. The awk command preprocesses variants.txt on-the-fly by adding necessary regular expressions:
    • Add # to the variant list in order to preserve VCF header information in final results
    • Prepend ^ to each variant in order to ensure that grep only searches from the start of each VCF line
    • Add a placeholder regex for each variant's ID field
    • Append a tab to each variant to ensure that grep does not include results for other indels in the VCF
  3. Lastly grep filters the results from BCFtools based on CHROM, POS, REF and ALT to only include variants from the original variant list
ADD COMMENT
1
Entering edit mode

I wondered if anyone could help me please? (apologies I can't give screen shots - as I work on a secure server). I want to extract all the sample IDs that relate to a specific variant. I have a multisample VCF file. My variant text file (tab delimited) contains: Chr Pos Ref Alt

I used to be able to extract this information using the code below and get all the IDs seperated by columns with the relevant VCF info (qual, filter, info, format, format_ouput).

bcftools view -R variants.txt mymultisample.vcf > newoutput.vcf

However, now all I get is a few columns with lots of mixed mish mash of information that doesn't make sense. I also tried bcftools view -t, bcftools view -r, different formatting of the variant.txt file but no luck.

I tried also the above code provided by Ram :

bcftools view -O v -R "$variants" "$vcf_in" \
 | grep -Ef <(awk 'BEGIN{FS=OFS="\t";print "#"};{print "^"$1,$2,"[^\t]+",$3,$4"\t"}' "$variants") \
 > "$vcf_out"

However, i only get a long line of IDs within one of the excel cells but no other information that I was previously getting.

Could it be the vcf fiel itself that is the problem?

Many thanks, Julia

ADD REPLY
0
Entering edit mode

Anyone could help at all with my above question? Julia

ADD REPLY
4
Entering edit mode
4.7 years ago
bari.ballew ▴ 460

This question is rather old and has an accepted answer, but I'll add another option since Ikirola has found the post and is having trouble.

Try bcftools isec. Make a minimal sites-only vcf containing your variants of interest, something like this:

##fileformat=VCFv4.3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   123 .   A   G   .   .   .
2   345 .   G   T   .   .   .

Bgzip and index your VCFs. Now use bcftools isec to find the intersection of your VCF and your variants of interest: bcftools isec -c none -p dir -n=2 -w1 vars_of_interest.vcf original.vcf

Explanation:

-c none is the default so you don't really need to include it in the command, but it tells bcftools to consider two variants as identical only if their chromosome, pos, ref, and alt are all identical. Note that this means A>G and A>G,C are NOT identical.

-p dir creates a subdirectory called "dir" to write your output.

-n=2 tells bcftools isec to output variants present in exactly two files.

-w1 Without this option, bcftools will print two files, one with the results of A that overlap with B, and another with the results of B that overlap with A. Since those would be redundant, we use this option to just print the one file.

ADD COMMENT
0
Entering edit mode

This is the fastest way to query a huge VCF file for specific variants, and it is convenient !

ADD REPLY
1
Entering edit mode
18 months ago

if you have ingested your VCF into a TileDB-VCF store, the query would look something like this (in Python)

#find samples that are heterozygous or homozygous alt at a given locus
import tiledbvcf
ds = tiledbvcf.Dataset(array_uri, mode = "r")
simpleQ = ds.read(
    attrs = ['alleles','fmt_GT','sample_name'],
    regions = ["chr7:144000358-144000358"],
)
simpleQ[simpleQ.fmt_GT.apply(tuple).apply(lambda x: x in [(0,1),(1,1)])]

this should be very fast - searching 4300 samples:

CPU times: user 49 µs, sys: 12 µs, total: 61 µs
Wall time: 120 µs
ADD COMMENT
0
Entering edit mode

But this method search on CHROM and POS only, we need search on CHROM, POS, REF, and ALT.

ADD REPLY
0
Entering edit mode
18 months ago
jon.klonowski ▴ 150

Know this is even older now but I would recommend makign binary files out of your vcf and using plink2. It is even faster and requires less computational resources. Processing large VCFs can take a long time and the commands can crash, leaving you with partial VCFs.

plink2 --bfile $VCF_BINARY --extract $VariantFile --make-bed  --out $VAR_FILTERED_BINARY_VCF_FILE
ADD COMMENT
1
Entering edit mode
  1. If the VCF has an INFO column worth keeping, you must use --pfile/--make-pgen instead of --bfile/--make-bed.
  2. If you want to keep read depths ("DP" FORMAT field) or similar per-genotype annotations around, you cannot use plink2 for VCF processing at all. Stick to bcftools in that case.
ADD REPLY

Login before adding your answer.

Traffic: 2005 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6