Question: should stop codon be counted as part of CDS
2
gravatar for -_-
3.2 years ago by
-_-830
Canada
-_-830 wrote:

I am analyzing the GTF file at http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz and try to understand gene structure. Based on observation, I summarized the following relationship

  1. UTR is part of exon
  2. CDS is part of exon
  3. start codon is part of CDS, hence part of exon, too
  4. stop codon is neither part of UTR or part of CDS, but it's still part of exon.

Therefore, given a transcript id, if I sum the length of each type of sequences, the following relationship should hold:

L_{exon} = L_{CDS} + L_{UTR} + L_{stop_codon}

I am only considering sequences whose source is protein_coding. After I assert this relationship to all 90273 transcript ids in the gtf file, it holds for 99.85% of transcripts. For the remaining 0.15% or 113 transcripts, it doesn't hold with the left side off by 1.

When I look into several of the 113 anomaly cases closely, at least for the 4 cases I have looked into, the relationship doesn't hold for the same reason. The 4 cases all have split stop codons, meaning part of the stop codon is in one exon (e.g. 2 bases), and the rest is in another exon (e.g. 1 base). Strangely, the first 2 bases don't count as part of CDS but the the 1-base part counts, which doesn't quite make sense to me. Can it be an error in the gtf file, please?

Below, I pasted a concrete example with the problematic region highlighted.

enter image description here

The entries are sorted by the start column. The sum of the lengths of all elements are

CDS            1993 
UTR            2363 
exon           4358 
start_codon    3    
stop_codon     3

Applying the above formula, the left side is 4358, the right side is 1993 + 2363 + 3 = 4359, and they DON'T match.

stop codon cds gtf • 1.7k views
ADD COMMENTlink modified 3.2 years ago by NAVANEETHAN.R0 • written 3.2 years ago by -_-830
4

Thank you for reporting this.

This is indeed an edge case we are not dealing with correctly. We will look into fixing this and are hoping to provide updated files for release 86.

Please note that the file your are pointing to will not be updated, as we do not replace archived data. However, you can always access the latest version of our files in this directory ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/

As the gene set for GRCh37 has been frozen, it will contain the same annotation as release 75, but with updates to the GTF format and bug fixes where necessary.

Regards, Magali

ADD REPLYlink written 3.2 years ago by Magali_Ensembl130

Hi Mag, thanks for your reply. Can you confirm if stop codon should be counted as part of CDS, please? You could also provide it as an answer and I will accept it.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by -_-830

In GTF format, stop codons are not part of the CDS, which is also why they are provided as separate features http://mblab.wustl.edu/GTF22.html

In GFF3 format, stop (and start) codons are part of the CDS and are implicit to the CDS coordinates https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Hope that helps, Magali

ADD REPLYlink written 3.2 years ago by Magali_Ensembl130
1

This is almost guaranteed to be a bug in how the file was made. Please contact the Ensembl help desk and let them know.

ADD REPLYlink written 3.2 years ago by Devon Ryan92k
0
gravatar for NAVANEETHAN.R
3.2 years ago by
NAVANEETHAN.R0 wrote:

Yes, Stop codon is a part of CDS in general.

ADD COMMENTlink written 3.2 years ago by NAVANEETHAN.R0
3

No, stop codons are not part of the CDS. The nucleotide sequence that is translated to amino acids is the CDS.

ADD REPLYlink written 3.2 years ago by Jenez520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour