Discrepencies between ANNOVAR and Gencode annotations of 1000 genomes
0
0
Entering edit mode
3.5 years ago
spiral01 ▴ 100

I have annotated 1000 genomes phase 3 data using Annovar, whilst also using the Gencode annotation (available here: http://www.internationalgenome.org/category/annotation/) and seem to have wildly different numbers of variants. Here are the steps I used to parse the data:

1. Filter out MNPs and Indels using bcftools.

2. Use vcftools to remove all variants that do not sit within exon co-ordinates withing bedfiles (obtained using biomaRt).

At this point I annotated using annovar, and then in both the annovar and gencode cases filtered out variants leaving only stop lost, stop gained, synonymous and non-synonymous variants, ensuring to use the correct terminology in each case (as Gencode use the ensembl codes (https://www.ensembl.org/info/genome/variation/predicted_data.html) whilst annovar does not).

At this point I would expect to have two annotated data sets of roughly the same size. However, the annovar annotated data is roughly 900,000 variants in size, whilst the Gencode data set is only 55,400 variants in size.

This is a huge discrepancy and I am not sure which is best to conduct my analysis on, if indeed any.

As an aside, for the Gencode dataset I filtered variants by the consequence determined in the first transcript.

SNP 1000genomes annotation • 1.0k views
0
Entering edit mode

How have you tabulated annotated variant numbers?

Can the discrepancy be explained by known splice-isoforms? Some genes have upward of 15 splice-isoforms because they are well studied. For many others, we've little idea how many isoforms there may be.

0
Entering edit mode

Hi Kevin, thank you for your reply. I have simply counted the number of remaining variants following filtering out all but missense variants etc. I am using Python for this so have the variants stored in a list. The list is roughly 900k long for the annovar annotation and 55,400 for the Gencode annotation.

In regards to splice-isoforms, how can I address this? Whilst Gencode provides the results for multiple transcripts, annovar provides only a refSeq ID. For my analysis I need the consequence of the SNP, as well as the coding sequence of the gene it belongs to. As the coding sequence depends on the transcript, how does one choose which to use? I initially chose the first transcript (canonical isoform) as it is longest.

Either way, I have two vastly different data sets with little idea of how to proceed from here?

0
Entering edit mode

It would help a lot to see examples from the top of the two datasets that you've got. For example, what has ANNOVAR got for rs1234, and what has Gencode got?

0
Entering edit mode

Here is a variant, without any annotation:

1   180271915   rs9425490   A   G   100 PASS    AC=596;AF=0.11901;AN=5008;NS=2504;DP=20846;EAS_AF=0.0159;AMR_AF=0.0605;AFR_AF=0.298;EUR_AF=0.0686;SAS_AF=0.0767;AA=A|||;VT=SNP


Here is the Gencode annotation:

1   180271915   rs9425490   A   G   100 PASS    AC=596;AF=0.11901;AN=5008;NS=2504;CSQ=G|ENSG00000135847|ENST00000367595|Transcript|intron_variant||||||rs9425490||||||-1||ACBD6|HGNC|||,G|ENSG00000135847|ENST00000475338|Transcript|intron_variant&nc_transcript_variant||||||rs9425490||||||-1||ACBD6|HGNC|||,G|ENSG00000135847|ENST00000496993|Transcript|intron_variant&nc_transcript_variant||||||rs9425490||||||-1||ACBD6|HGNC|||;GERP=0.09


And here is the Annovar annotation:

line251705  synonymous SNV  LHX4:NM_033343:exon5:c.G687G:p.K229K    1   180271915   180271915   A   G   1   180271915   rs9425490   A   G   100 PASS    AC=596;AF=0.11901;AN=5008;NS=2504;DP=20846;EAS_AF=0.0159;AMR_AF=0.0605;AFR_AF=0.298;EUR_AF=0.0686;SAS_AF=0.0767;AA=A|||;VT=SNP


That variant would be filtered out for the Gencode annotation under my criteria as it is an intron variant, but kept in the annovar annotation as it is a synonymous SNV.

0
Entering edit mode

Well there's instantly 1 strange occurrence because ANNOVAR has labelled this as falling within LHX4 whereas in Gencode it is ACBD6 (LHX4 is a long gene that 'runs into' ACBD6).

You may just have to look through each file manually using vim (vi) and see where they differ. Once you see a pattern (of difference) emerging, you'll know how to manage the problem.

0
Entering edit mode

Thanks, I will have a look. However even once I recognise a pattern, how can I tell what is correct? If there is no canonical annotation?

1
Entering edit mode

I usually look at things in the UCSC Genome Browser for clarification. If there's a discrepancy, it can be due to situations like the LHX4 and ACBD6 genes - there are countless examples like that where genes 'overlap' each other, making annotation difficult.

Also, I think you'll find that there is disagreement on which transcript is indeed the canonical one, which neither helps. Some regard the longest isoform as canonical, whilst for others it's the one that's most expressed (which may not necessarily be the longest). Then again, isoforms are expressed at different levels in different tissues.

If you want to make your data available, I'd be happy to look, but if it's important unpublished data then obviously don't!

0
Entering edit mode

I'd be extremely grateful if you could take a quick look. Is there anywhere you recommend for me to make it available? Thanks.

0
Entering edit mode

You can probably make the file available on Google Drive, Outlook OneDrive, Box dot com, or DropBox and then share the link.

I just ran a test on chr22 1000 Genomes variants and didn't notice anything unusual.

Kevin

0
Entering edit mode

I think I have solved the issue. Many variant types such as intergenic variants are classified as synonymous SNVs by Annovar and were therefore being included in my analysis, whilst being filtered out with Gencode. Many thanks for all your help.

0
Entering edit mode

Great, but that's also an interesting finding as to why they would be called synonymous. Could be an error. Anyway, glad that you got it right.