Problems left-normalizing variants with bctools, GATK, combining with ANNOVAR annotations
1
0
Entering edit mode
6.0 years ago
steve ★ 3.5k

I am trying to normalize, filter, and annotate variants in .vcf format. Right now, my workflow looks like this:

  • left-normalize & filter .vcf (bcftools / GATK)

  • convert .vcf to .tsv (GATK)

  • recalculate values in .tsv (e.g. HaplotypeCaller frequency, etc.)

  • annotate .vcf (ANNOVAR)

  • merge annotation & .tsv

However, I am having issues with variants that are formatted like this:

chrX    66766356    .   TGGCGGCGGCGGC   T

when I try to 'normalize' them using bcftools norm and GATK LeftAlignAndTrimVariants, these variants are not changed.

But, when I pass these variants through ANNOVAR, the output looks like this:

chrX    66766357    .   GGCGGCGGCGGC    -

This is the preferred format for annotations. But it causes problems because I am now unable to merge values from the original .vcf back into the ANNOVAR output.

As per the comment on the bcftools issue posted here, the ANNOVAR output format is "not a valid VCF record". So it seems that maybe variant normalization tools would not be appropriate for producing this output?

Any ideas on how to fix this workflow and get both the custom selected & recalculated fields from the original .vcf combined with the ANNOVAR output in these cases?

annotation variant • 2.9k views
ADD COMMENT
1
Entering edit mode

Yes, this is the annoying part about ANNOVAR, and it can result in inadvertent information loss if one is not aware of it. I have come up with a few personalised solutions that bypass this, but my situations were not the same as yours. For one, I never wanted to 'marry' the annotated data back to the VCF. The annotation CSV was the end of the line for me.

Why exactly do you need to 'marry' the annotated data back to the VCF? I think that the ANNOVAR function allows you to include various pieces of information from the VCF as extra columns, no?

ADD REPLY
0
Entering edit mode

In this case, I need to recalculate the allele frequencies from GATK HaplotypeCaller, since they are not listed as the empirical values, and I want to propagate that value through to the final annotation table.

ADD REPLY
0
Entering edit mode
6.0 years ago
steve ★ 3.5k

It looks like a lot of my problems were solved by using the --vcfinput option of ANNOVAR, among other things, like this:

table_annovar.pl "${sample_vcf}" "${annovar_db_dir}" \
        --buildver "${params.ANNOVAR_BUILD_VERSION}" \
        --remove \
        --protocol "${params.ANNOVAR_PROTOCOL}" \
        --operation "${params.ANNOVAR_OPERATION}" \
        --nastring . \
        --vcfinput \
        --otherinfo \
        --onetranscript \
        --outfile "${sampleID}"

This produces an .avinput file that has a listing of all the original lines from the input VCF with their ANNOVAR counterparts, and this command also includes the original VCF data on the annotation file output, along with an extra .vcf formatted annotation file. So, a lot of extra data to play with for custom processing.

For reference, the full workflow I am working on is here: https://github.com/stevekm/vcf-filter-annotate

ADD COMMENT
1
Entering edit mode

Great - feel free to accept your own answer. The 'bug' to which I was referring was for when one was arriving via a non-VCF route, i.e., a custom variant list. Most people would not be aware that ANNOVAR requires this specific format for indels, thus, the indels that are nt encoded correctly end up in the .invalid_input file after annotation.

ADD REPLY

Login before adding your answer.

Traffic: 2583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6