Question

Annovar - RNA level variant positions

0

Entering edit mode

9.4 years ago

jacobsen.jeremy ▴ 40

I have run Annovar on GATK output after inserting a row for the end locus (following Annovar prepare input file tutorial). The script I am using is: annotate_variation.pl -out gatk -build hg19 example/gatkfile humandb/ -dbtype knownGene so that I can get UCSC transcript annotations.

For simple insertions/deletions I can pull out a protein sequence from hg_19knownPep to see if the variant position information (for instance G952A) is correct. I wrote code to do this for all non-synonymous SNVs and the Annovar annotations are correct for all of them.

On the other hand, this is not the case when I look at the RNA level. For instance, take the Annovar entry:

frameshift substitution    NBPF8:uc031pny.1:exon2:c.116_116delinsGAA,    chr1    144615250    144615250    G    GAA

When I get the RNA sequence from the Annovar HG19 reference file for uc031pny.1 I notice the following which is causing me confusion:

G is at chr1:144615250 in IGV forward strand (check)
But when I get the mRNA sequence from Annovar's KnownGeneMrna, the nucleotide at position 116 is T. This is pretty consistent for all the substitutions and deletions in the output. I think I'm misinterpreting something but I'm not sure what. Any help would be excellent.

Thanks,
Jeremy

SNP annovar RNA-Seq • 2.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by jacobsen.jeremy ▴ 40

Ram · Answer 1 · 2014-11-17

0

Entering edit mode

9.4 years ago

Stoploss25 ▴ 10

It may be that 116 is the position from the start of the exon (in this case exon 2), not the start of the transcript.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Stoploss25 ▴ 10

Ram · Answer 2 · 2014-11-17

So I took a close look at another example:

line23941    frameshift substitution    CDK18:uc009xbm.1:exon6:c.428_430G,    chr1    205495889    205495891    GCT    G.

Since it is 6 exons out, I calculated the offset caused by the first 5 exons in the transcript. Values were taken from Annovar's "knownGene" reference

205492753     205492610     143
205493485     205493359     126
205494323     205494266     57
205495307     205495192     115
205495589     205495494     95

              total offset  536

This means that the position from the start of the transcript should be 536+428 = 964 (if these are from the exon start). The actual position from the start is 565 according to "knownGeneMrna". Also, according to IGV, the deletion is between exons 7 and 8 (not 6).

Ram · Answer 3 · 2014-11-21

0

Entering edit mode

9.4 years ago

jacobsen.jeremy ▴ 40

It turns out that UTRs are included in the rna sequence assigned to the ucsc accession number. The 428 in c.428_430G is the 428th nucleotide from the start of the CDS.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by jacobsen.jeremy ▴ 40