VEP annotation doubt with NM (NM_000000.1_dup19)
1
2
Entering edit mode
3.6 years ago

Hi community!,

I'm annotating variants with the VEP software and I'm finding some unexpected transcript data of the type:

• NM_014938.4_dupl16
• NM_001170637.2_dupl3
    1       206516261       .       C       T       47      PASS CSQ=T|non_coding_transcript_exon_variant|MODIFIER|SRGAP2|23380|Transcript|NM_001170637.2_dupl3|mRNA|1/20||NM_001170637.2_dupl3.1:n.65C>T||65|||||||1||SNV|EntrezGene|||||||||||C|C||||||||||||||||||||||||||||||||||||||||||||||||||||||||||1:206516261-206516261|0.4996565||||,
T|missense_variant|MODERATE|SRGAP2|23380|Transcript|NM_001170637.3|protein_coding|1/20||NM_001170637.3:c.65C>T|NP_001164108.1:p.Arg289Trp|864|865|289|R/W|Cgg/Tgg|||1||SNV|EntrezGene||||||NP_001164108.1|||||C|C|OK|||||||||||||||||||||||||||||||||||0.63580||||T|T||||||||||2|||||||1:206516261-206516261|0.4996565||||,
T|missense_variant|MODERATE|SRGAP2|23380|Transcript|NM_001300952.1|protein_coding|1/18||NM_001300952.1:c.65C>T|NP_001287881.1:p.Arg289Trp|864|865|289|R/W|Cgg/Tgg|||1||SNV|EntrezGene||||||NP_001287881.1|||||C|C|OK|||||||||||||||||||||||||||||||||||0.63580||||T|T||||||||||2|||||||1:206516261-206516261|0.4996565||||,
T|non_coding_transcript_exon_variant|MODIFIER|SRGAP2|23380|Transcript|NM_015326.3_dupl3|mRNA|1/20||NM_015326.3_dupl3.1:n.65C>T||65|||||||1||SNV|EntrezGene||YES|||||||||C|C||||||||||||||||||||||||||||||||||||||||||||||||||||||||||1:206516261-206516261|0.4996565||||,


Searching on the VEP webpage or in the internet I can't find any reference to this kind of "dupl" suffix. Has anyone faced this? I don't know if they are alternatives of the transcript or explain why they are not transcripts on its own.

Cristian.

Edit: Added example of variant with the vep annotation of dup (NM_015326.3_dupl3)

Edit2: Using VEP ensembl version 91.1 with cache v91

vep • 1.1k views
0
Entering edit mode

could you post the variants (VCF records) that cause this annotation?

0
Entering edit mode

Also, which column of your VEP output are you finding this notation in?

0
Entering edit mode

Hi Emily, it's the parameter that references the transcript, the "Feature" column (I'm actually outputting in a VCF format).

1
Entering edit mode

Thanks, will try to trace.

0
Entering edit mode

Are you using GRCh37?

0
Entering edit mode

Yes, version 91 of GRCh37

0
Entering edit mode

I think NM_015326.3_dupl3.1 and other entries mentioned in OP are feature (transcript) names in that build. Variation reporter for NC_000001.10:g.206516261C>T for GRCh37.p13 (AR-105, dbSNP v 149): doesn' list coding variant at position 65, instead at 322 (NM_015326.4:c.322C>T) and has only one annotation instead of 2, which is mentioned above.

0
Entering edit mode

I supposed that is something like that. What intrigues us is why name it like a "duplXX". We thought that they may be duplicates from another transcripts or reference transcripts with duplicate exons, but watching that "dupl16" was really strange.

4
Entering edit mode
3.6 years ago

We are investigating these. It looks like some RefSeq transcripts (eg NM_001170637.3) have been duplicated in Ensembl's other_features database with a lower version number and this dupl suffix (eg NM_001170637.2_dupl3). This has been propagated across to the VEP cache, which is why you're seeing them. We don't currently know why, but we believe that you can just ignore them from your analyses for now.

1
Entering edit mode

We've uncovered the source. This occurs in our pipeline that import the RefSeq transcripts. If we find that there are two with the same ID (eg NM_001170637.3 and NM_001170637.2), the pipeline is adding this dupl suffix, instead of the sensible option of just deleting the older one. We're not sure why we've written the pipeline in this way, as it seems a bit silly, but we will fix it. In the meantime, as I said before, just ignore them.

1
Entering edit mode

Thanks Emily. Something is worrying me though... Shouldn't this be happening always that there's two transcripts with same ID but different version? Should I always get the last version of the transcript for each variant? I'm pointing this because I'm finding for the same variant two different versions of the same ID, (NM_001170637.3 and NM_001170637.2 [NOT a real example, but if you need it I can find one]).

I imagine that the "dupl" error is because of some release and not a general issue. Right now I'm doing a check in every variant to remove old versions of the refseq ID if there's a new one.

1
Entering edit mode

There's 36 of them in the up-to-date database, so it's not universal. For various historical and political reasons, we have have two different snapshots of the RefSeq database which we merge together, so it will only be things that have been updated between those two snapshots.