Question: VEP annotation doubt with NM (NM_000000.1_dup19)
2.4 years ago
Cristian.perez wrote:

Hi community!,

I'm annotating variants with the VEP software and I'm finding some unexpected transcript data of the type:

  • NM_014938.4_dupl16
  • NM_001170637.2_dupl3
    1       206516261       .       C       T       47      PASS CSQ=T|non_coding_transcript_exon_variant|MODIFIER|SRGAP2|23380|Transcript|NM_001170637.2_dupl3|mRNA|1/20||NM_001170637.2_dupl3.1:n.65C>T||65|||||||1||SNV|EntrezGene|||||||||||C|C||||||||||||||||||||||||||||||||||||||||||||||||||||||||||1:206516261-206516261|0.4996565||||,
    T|missense_variant|MODERATE|SRGAP2|23380|Transcript|NM_015326.4|protein_coding|1/20||NM_015326.4:c.65C>T|NP_056141.2:p.Arg289Trp|864|865|289|R/W|Cgg/Tgg|||1||SNV|EntrezGene||YES||||NP_056141.2|||||C|C|OK|||||||||||||||||||||||||||||||||||0.63580||||T|T||||||||||2|||||||1:206516261-206516261|0.4996565||||       GT:DP:VD:AD:AF:RD:ALD   0/1:9:3:6,3:0.3333:6,0:3,0

Searching on the VEP webpage or in the internet I can't find any reference to this kind of "dupl" suffix. Has anyone faced this? I don't know if they are alternatives of the transcript or explain why they are not transcripts on its own.

Thanks in advance!


Edit: Added example of variant with the vep annotation of dup (NM_015326.3_dupl3)

Edit2: Using VEP ensembl version 91.1 with cache v91

vep • 810 views
modified 2.4 years ago by Emily_Ensembl21k • written 2.4 years ago by Cristian.perez50

could you post the variants (VCF records) that cause this annotation?

written 2.4 years ago by cpad011214k

Also, which column of your VEP output are you finding this notation in?

written 2.4 years ago by Emily_Ensembl21k

Hi Emily, it's the parameter that references the transcript, the "Feature" column (I'm actually outputting in a VCF format).

written 2.4 years ago by Cristian.perez50

Thanks, will try to trace.

written 2.4 years ago by Emily_Ensembl21k

Are you using GRCh37?

written 2.4 years ago by Emily_Ensembl21k

Yes, version 91 of GRCh37

written 2.4 years ago by Cristian.perez50

I think NM_015326.3_dupl3.1 and other entries mentioned in OP are feature (transcript) names in that build. Variation reporter for NC_000001.10:g.206516261C>T for GRCh37.p13 (AR-105, dbSNP v 149): doesn' list coding variant at position 65, instead at 322 (NM_015326.4:c.322C>T) and has only one annotation instead of 2, which is mentioned above.

modified 2.4 years ago • written 2.4 years ago by cpad011214k

I supposed that is something like that. What intrigues us is why name it like a "duplXX". We thought that they may be duplicates from another transcripts or reference transcripts with duplicate exons, but watching that "dupl16" was really strange.

ADD REPLYlink written 2.4 years ago by Cristian.perez50
2.4 years ago
Emily_Ensembl wrote:

We are investigating these. It looks like some RefSeq transcripts (eg NM_001170637.3) have been duplicated in Ensembl's other_features database with a lower version number and this dupl suffix (eg NM_001170637.2_dupl3). This has been propagated across to the VEP cache, which is why you're seeing them. We don't currently know why, but we believe that you can just ignore them from your analyses for now.

modified 2.4 years ago • written 2.4 years ago by Emily_Ensembl21k

We've uncovered the source. This occurs in our pipeline that import the RefSeq transcripts. If we find that there are two with the same ID (eg NM_001170637.3 and NM_001170637.2), the pipeline is adding this dupl suffix, instead of the sensible option of just deleting the older one. We're not sure why we've written the pipeline in this way, as it seems a bit silly, but we will fix it. In the meantime, as I said before, just ignore them.

I'm really sorry about this.

written 2.4 years ago by Emily_Ensembl21k

Thanks Emily. Something is worrying me though... Shouldn't this be happening always that there's two transcripts with same ID but different version? Should I always get the last version of the transcript for each variant? I'm pointing this because I'm finding for the same variant two different versions of the same ID, (NM_001170637.3 and NM_001170637.2 [NOT a real example, but if you need it I can find one]).

I imagine that the "dupl" error is because of some release and not a general issue. Right now I'm doing a check in every variant to remove old versions of the refseq ID if there's a new one.

written 2.4 years ago by Cristian.perez50

There's 36 of them in the up-to-date database, so it's not universal. For various historical and political reasons, we have have two different snapshots of the RefSeq database which we merge together, so it will only be things that have been updated between those two snapshots.

written 2.4 years ago by Emily_Ensembl21k
