I have annotated 1000 genomes phase 3 data using Annovar, whilst also using the Gencode annotation (available here: http://www.internationalgenome.org/category/annotation/) and seem to have wildly different numbers of variants. Here are the steps I used to parse the data:
Filter out MNPs and Indels using bcftools.
Use vcftools to remove all variants that do not sit within exon co-ordinates withing bedfiles (obtained using biomaRt).
At this point I annotated using annovar, and then in both the annovar and gencode cases filtered out variants leaving only stop lost, stop gained, synonymous and non-synonymous variants, ensuring to use the correct terminology in each case (as Gencode use the ensembl codes (https://www.ensembl.org/info/genome/variation/predicted_data.html) whilst annovar does not).
At this point I would expect to have two annotated data sets of roughly the same size. However, the annovar annotated data is roughly 900,000 variants in size, whilst the Gencode data set is only 55,400 variants in size.
This is a huge discrepancy and I am not sure which is best to conduct my analysis on, if indeed any.
As an aside, for the Gencode dataset I filtered variants by the consequence determined in the first transcript.
How have you tabulated annotated variant numbers?
Can the discrepancy be explained by known splice-isoforms? Some genes have upward of 15 splice-isoforms because they are well studied. For many others, we've little idea how many isoforms there may be.
Hi Kevin, thank you for your reply. I have simply counted the number of remaining variants following filtering out all but missense variants etc. I am using Python for this so have the variants stored in a list. The list is roughly 900k long for the annovar annotation and 55,400 for the Gencode annotation.
In regards to splice-isoforms, how can I address this? Whilst Gencode provides the results for multiple transcripts, annovar provides only a refSeq ID. For my analysis I need the consequence of the SNP, as well as the coding sequence of the gene it belongs to. As the coding sequence depends on the transcript, how does one choose which to use? I initially chose the first transcript (canonical isoform) as it is longest.
Either way, I have two vastly different data sets with little idea of how to proceed from here?
It would help a lot to see examples from the top of the two datasets that you've got. For example, what has ANNOVAR got for rs1234, and what has Gencode got?
Here is a variant, without any annotation:
Here is the Gencode annotation:
And here is the Annovar annotation:
That variant would be filtered out for the Gencode annotation under my criteria as it is an intron variant, but kept in the annovar annotation as it is a synonymous SNV.
Well there's instantly 1 strange occurrence because ANNOVAR has labelled this as falling within LHX4 whereas in Gencode it is ACBD6 (LHX4 is a long gene that 'runs into' ACBD6).
You may just have to look through each file manually using
vi) and see where they differ. Once you see a pattern (of difference) emerging, you'll know how to manage the problem.
Thanks, I will have a look. However even once I recognise a pattern, how can I tell what is correct? If there is no canonical annotation?
I usually look at things in the UCSC Genome Browser for clarification. If there's a discrepancy, it can be due to situations like the LHX4 and ACBD6 genes - there are countless examples like that where genes 'overlap' each other, making annotation difficult.
Also, I think you'll find that there is disagreement on which transcript is indeed the canonical one, which neither helps. Some regard the longest isoform as canonical, whilst for others it's the one that's most expressed (which may not necessarily be the longest). Then again, isoforms are expressed at different levels in different tissues.
If you want to make your data available, I'd be happy to look, but if it's important unpublished data then obviously don't!
I'd be extremely grateful if you could take a quick look. Is there anywhere you recommend for me to make it available? Thanks.
You can probably make the file available on Google Drive, Outlook OneDrive, Box dot com, or DropBox and then share the link.
I just ran a test on chr22 1000 Genomes variants and didn't notice anything unusual.
I think I have solved the issue. Many variant types such as intergenic variants are classified as synonymous SNVs by Annovar and were therefore being included in my analysis, whilst being filtered out with Gencode. Many thanks for all your help.
Great, but that's also an interesting finding as to why they would be called synonymous. Could be an error. Anyway, glad that you got it right.