Question: In the gff3 format, could one eukaryotic mRNA contain more than one protein coding sequences (i.e. polycistronic)?
2
gravatar for I0110
14 months ago by
I0110120
United States
I0110120 wrote:

Below is a simple example of gff3 file:

1   T1  gene    3631    4605    .   +   .   ID=ATNG01010
1   T1  mRNA    3631    4605    .   +   .   ID=ATNG01010.1;Parent=ATNG01010
1   T1  exon    3631    3913    .   +   .   ID=ATNG01010:exon:1;Parent=ATNG01010.1
1   T1  CDS 3860    3913    .   +   0   ID=ATNG01010:CDS:1;Parent=ATNG01010.1
1   T1  exon    3996    4276    .   +   .   ID=ATNG01010:exon:2;Parent=ATNG01010.1
1   T1  CDS 3996    4260    .   +   2   ID=ATNG01010:CDS:2;Parent=ATNG01010.1
1   T1  exon    4486    4605    .   +   .   ID=ATNG01010:exon:3;Parent=ATNG01010.1

My question is: if we found another coding sequence (encode a different protein) range from 3752 to 3904, how should the gff3 file look like? It seems to me that the gff3 file can only allow one protein-coding gene per mRNA. If not, could anyone show me one example? Thank you!

annotation gff3 • 582 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by I0110120
4
gravatar for mbens
14 months ago by
mbens100
Germany
mbens100 wrote:

In principle, you can define an arbitrary number of CDS per mRNA. The Parent attribute of each CDS indicates to which mRNA it belongs. If your CDS feature spans multiple lines (discontinuous features) it must have an ID to indicate lines that collectively represent the CDS. In fact, your example already contains two different protein coding sequences for mRNA 'ATNG01010.1', namely 'ATNG01010:CDS:1' and 'ATNG01010:CDS:2'. You could add a third one using the same pattern.

GFF Specification: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

ADD COMMENTlink modified 14 months ago • written 14 months ago by mbens100

Hi, mbens, thanks for your help. What I meant is actually: Can one gene/mRNA contains more than one open reading frames? Not CDS. I updated the question. My apologies.

ADD REPLYlink written 14 months ago by I0110120
1

I don't understand how that is different. Do you mean polycistronic transcripts? Or maybe uORFs (Upstream Open Reading Frame)?

In case of polycistronic transcripts:

  • define both genes (and assign different ID attributes, e.g. ID=geneA and ID=geneB)
  • define a single mRNA feature (e.g. ID=mrnaX) and list all comprised genes in its Parent attribute (e.g. Parent=geneA,geneB)
  • define both ORFs/CDS (e.g. ID=CDSx and ID=CDSy) and assign the mRNA as Parent (e.g. Parent=mrnaX)
  • add "Derives_from" attribute to ORFs/CDS to indicate its origin (e.g. Derives_from=geneA and Derives_from=geneB)

In case of uORFs I am not aware of a special gff3 definition. I would add two CDS features and use the 'note' attribute to indicate that one is an uORF.

EDIT: According to Sequence Ontology you could use ' five_prime_open_reading_frame' as type (3rd column) for upstream open reading frames.

ADD REPLYlink modified 14 months ago • written 14 months ago by mbens100

Brilliant, mbens! Indeed, "polycistronic" is exactly what I was looking for and should be used for this question. For the gene I have been working on, it is one gene/transcript by annotation, but riboseq data suggests 2 possible ORFs with different peptide sequences. By annotation, it is one gene. I just found a similar case in Arabidopsis gene model. They apparently make define it as one gene but different transcripts although the two transcripts are identical and the CDS part is different (see bellow). I guess both your suggestion and their method would work to create a gff. Thanks again.

From their gff file:

Chr5    TAIR10  gene    758374  760382  .   +   .   ID=AT5G03190;Note=protein_coding_gene;Name=AT5G03190
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.1;Parent=AT5G03190;Name=AT5G03190.1;Index=1
Chr5    TAIR10  protein 758793  760148  .   +   .   ID=AT5G03190.1-Protein;Name=AT5G03190.1;Derives_from=AT5G03190.1
Chr5    TAIR10  exon    758374  760382  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  five_prime_UTR  758374  758792  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  CDS 758793  760148  .   +   0   Parent=AT5G03190.1,AT5G03190.1-Protein;
Chr5    TAIR10  three_prime_UTR 760149  760382  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.2;Parent=AT5G03190;Name=AT5G03190.2;Index=1
Chr5    TAIR10  protein 758539  760148  .   +   .   ID=AT5G03190.2-Protein;Name=AT5G03190.2;Derives_from=AT5G03190.2
Chr5    TAIR10  exon    758374  758660  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  five_prime_UTR  758374  758538  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  CDS 758539  758660  .   +   0   Parent=AT5G03190.2,AT5G03190.2-Protein;
Chr5    TAIR10  exon    758843  760382  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  CDS 758843  760148  .   +   1   Parent=AT5G03190.2,AT5G03190.2-Protein;
Chr5    TAIR10  three_prime_UTR 760149  760382  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.3;Parent=AT5G03190;Name=AT5G03190.3;Index=1
Chr5    TAIR10  protein 758539  758676  .   +   .   ID=AT5G03190.3-Protein;Name=AT5G03190.3;Derives_from=AT5G03190.3
Chr5    TAIR10  exon    758374  760382  .   +   .   Parent=AT5G03190.3
Chr5    TAIR10  five_prime_UTR  758374  758538  .   +   .   Parent=AT5G03190.3
Chr5    TAIR10  CDS 758539  758676  .   +   0   Parent=AT5G03190.3,AT5G03190.3-Protein;
Chr5    TAIR10  three_prime_UTR 758677  760382  .   +   .   Parent=AT5G03190.3
ADD REPLYlink modified 14 months ago • written 14 months ago by I0110120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1995 users visited in the last hour