GFF3 converter/parser multiple tRNAs with the same name/ID
2
0
Entering edit mode
6.1 years ago
seraphya • 0

I am editing a parser that was written quick and dirty for turning an .sqn file into a GFF3 file. The main thing I was doing was adding support for introns and exons. I then noticed that if a tRNA that codes for the same amino acid with the same anti-codon appears twice it will get the same name and ID. If they are on different strands the validator I was using would through up an error. This made me suspicious that I was doing things wrong.

Should they all be called: ID=trnI(gau);name=trnI(gau)

or should I have  ID=trnI(gau)01;name=trnI(gau)01 , ID=trnI(gau)02;name=trnI(gau)02

or something else?

I know that it doesn't meet my need now because when I import the GFF3 file into geneious it combines them into one annotation. However I will eventually be submitting to GenBank so I don't want weirdly formatted tRNA annotation names.

GFF3 tRNA parser converter • 1.7k views
1
Entering edit mode
6.1 years ago
Juke34 ★ 6.4k

Hi Seraphya,

As said by the SO project. IDs are not necessarily Uniq:

In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.

So it could be the case for CDS.

But in your case, even if you have identical tRNA, I assume they are not at the same position of your genome (else it's an information duplicated). Consequently, they have to be different even if they are on the same strand.

Let me know if it's not clear.

0
Entering edit mode

That is helpful, but I am still left wondering how to implement this. I guess the names could be the same, but I have to have the IDs be unique. I was just wondering if there is a standard way of labeling the IDs in a GFF3 file so that they are all unique

ID=trnI(gau)01;name=trnI(gau) , ID=trnI(gau)02;name=trnI(gau)

0
Entering edit mode

I never heard about specific standard for that purpose. I think you can follow your instinct about what is the best or take inspiration about how it's done by others (ENSEMBL?).

Usually what I'm doing is just give a name like "tRNA-1" with a value starting to 1 that is incremented for every tRNA. I don't take in account the type of tRNA but you can do it.

0
Entering edit mode
6.1 years ago

It's true that the same ID may appear on multiple lines in a GFF3 file, but I think it's a bit misleading to say that IDs are not unique. IDs are unique, and when the same ID appears on multiple lines it is because the corresponding feature is a multi-feature. With the CDS example, if the coding sequence is split across 4 exons, there will be 4 CDS entries in the GFF3 file. But those 4 lines are not separate features, they collectively represent a single feature, and so they have the same (unique) ID. Juke-34 and I are essentially saying the same thing, but how you say it is important I think.

As far as what the actual IDs should be, it really doesn't matter. Something simple like numbering the tRNAs (as suggested by Juke-34 is probably your best bet. The only purpose of the ID is to define parent-child relationships, so as long as the value of the ID attribute matches the value of the Parent attribute of the children, it doesn't matter what the ID is. In fact, some programs or scripts will change the ID values, but as long as the parent-child relationships are preserved this is completely valid. If there is any other information you want to preserve (such as a gene name or its predicted function), this should really be stored in a different attribute such as Name or Note.