Question: Problems left-normalizing variants with bctools, GATK, combining with ANNOVAR annotations
0
gravatar for steve
10 months ago by
steve1.9k
United States
steve1.9k wrote:

I am trying to normalize, filter, and annotate variants in .vcf format. Right now, my workflow looks like this:

  • left-normalize & filter .vcf (bcftools / GATK)

  • convert .vcf to .tsv (GATK)

  • recalculate values in .tsv (e.g. HaplotypeCaller frequency, etc.)

  • annotate .vcf (ANNOVAR)

  • merge annotation & .tsv

However, I am having issues with variants that are formatted like this:

chrX    66766356    .   TGGCGGCGGCGGC   T

when I try to 'normalize' them using bcftools norm and GATK LeftAlignAndTrimVariants, these variants are not changed.

But, when I pass these variants through ANNOVAR, the output looks like this:

chrX    66766357    .   GGCGGCGGCGGC    -

This is the preferred format for annotations. But it causes problems because I am now unable to merge values from the original .vcf back into the ANNOVAR output.

As per the comment on the bcftools issue posted here, the ANNOVAR output format is "not a valid VCF record". So it seems that maybe variant normalization tools would not be appropriate for producing this output?

Any ideas on how to fix this workflow and get both the custom selected & recalculated fields from the original .vcf combined with the ANNOVAR output in these cases?

variant annotation • 575 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by steve1.9k
1

Yes, this is the annoying part about ANNOVAR, and it can result in inadvertent information loss if one is not aware of it. I have come up with a few personalised solutions that bypass this, but my situations were not the same as yours. For one, I never wanted to 'marry' the annotated data back to the VCF. The annotation CSV was the end of the line for me.

Why exactly do you need to 'marry' the annotated data back to the VCF? I think that the ANNOVAR function allows you to include various pieces of information from the VCF as extra columns, no?

ADD REPLYlink written 10 months ago by Kevin Blighe37k

In this case, I need to recalculate the allele frequencies from GATK HaplotypeCaller, since they are not listed as the empirical values, and I want to propagate that value through to the final annotation table.

ADD REPLYlink written 10 months ago by steve1.9k
0
gravatar for steve
10 months ago by
steve1.9k
United States
steve1.9k wrote:

It looks like a lot of my problems were solved by using the --vcfinput option of ANNOVAR, among other things, like this:

table_annovar.pl "${sample_vcf}" "${annovar_db_dir}" \
        --buildver "${params.ANNOVAR_BUILD_VERSION}" \
        --remove \
        --protocol "${params.ANNOVAR_PROTOCOL}" \
        --operation "${params.ANNOVAR_OPERATION}" \
        --nastring . \
        --vcfinput \
        --otherinfo \
        --onetranscript \
        --outfile "${sampleID}"

This produces an .avinput file that has a listing of all the original lines from the input VCF with their ANNOVAR counterparts, and this command also includes the original VCF data on the annotation file output, along with an extra .vcf formatted annotation file. So, a lot of extra data to play with for custom processing.

For reference, the full workflow I am working on is here: https://github.com/stevekm/vcf-filter-annotate

ADD COMMENTlink written 10 months ago by steve1.9k
1

Great - feel free to accept your own answer. The 'bug' to which I was referring was for when one was arriving via a non-VCF route, i.e., a custom variant list. Most people would not be aware that ANNOVAR requires this specific format for indels, thus, the indels that are nt encoded correctly end up in the .invalid_input file after annotation.

ADD REPLYlink written 10 months ago by Kevin Blighe37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2404 users visited in the last hour