should stop codon be counted as part of CDS
3
2
Entering edit mode
4.8 years ago
-_- ▴ 890

I am analyzing the GTF file at http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz and try to understand gene structure. Based on observation, I summarized the following relationship

1. UTR is part of exon
2. CDS is part of exon
3. start codon is part of CDS, hence part of exon, too
4. stop codon is neither part of UTR or part of CDS, but it's still part of exon.

Therefore, given a transcript id, if I sum the length of each type of sequences, the following relationship should hold:

L_{exon} = L_{CDS} + L_{UTR} + L_{stop_codon}


I am only considering sequences whose source is protein_coding. After I assert this relationship to all 90273 transcript ids in the gtf file, it holds for 99.85% of transcripts. For the remaining 0.15% or 113 transcripts, it doesn't hold with the left side off by 1.

When I look into several of the 113 anomaly cases closely, at least for the 4 cases I have looked into, the relationship doesn't hold for the same reason. The 4 cases all have split stop codons, meaning part of the stop codon is in one exon (e.g. 2 bases), and the rest is in another exon (e.g. 1 base). Strangely, the first 2 bases don't count as part of CDS but the the 1-base part counts, which doesn't quite make sense to me. Can it be an error in the gtf file, please?

Below, I pasted a concrete example with the problematic region highlighted.

The entries are sorted by the start column. The sum of the lengths of all elements are

CDS            1993
UTR            2363
exon           4358
start_codon    3
stop_codon     3


Applying the above formula, the left side is 4358, the right side is 1993 + 2363 + 3 = 4359, and they DON'T match.

gtf cds stop codon • 2.7k views
4
Entering edit mode

Thank you for reporting this.

This is indeed an edge case we are not dealing with correctly. We will look into fixing this and are hoping to provide updated files for release 86.

Please note that the file your are pointing to will not be updated, as we do not replace archived data. However, you can always access the latest version of our files in this directory ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/

As the gene set for GRCh37 has been frozen, it will contain the same annotation as release 75, but with updates to the GTF format and bug fixes where necessary.

Regards, Magali

0
Entering edit mode

Hi Mag, thanks for your reply. Can you confirm if stop codon should be counted as part of CDS, please? You could also provide it as an answer and I will accept it.

0
Entering edit mode

In GTF format, stop codons are not part of the CDS, which is also why they are provided as separate features http://mblab.wustl.edu/GTF22.html

In GFF3 format, stop (and start) codons are part of the CDS and are implicit to the CDS coordinates https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Hope that helps, Magali

1
Entering edit mode

This is almost guaranteed to be a bug in how the file was made. Please contact the Ensembl help desk and let them know.

1
Entering edit mode
4.8 years ago

Yes, Stop codon is a part of CDS in general.

3
Entering edit mode

No, stop codons are not part of the CDS. The nucleotide sequence that is translated to amino acids is the CDS.

0
Entering edit mode
17 months ago
Juke34 ★ 5.7k

As @Magali_Ensembl nicely said it, In GTF format, stop codons are not part of the CDS while in GFF they are. I did a review of those format here if you are interested:

https://github.com/NBISweden/GAAS/blob/master/annotation/CheatSheet/gxf.md

Read the part Problem encountered due to lack of standardization, for similar problem encountered with stop codon....

But as usual there is an exception: Genbank, at least until GTF2.2 it sounds they were including the stop codon. It's what is mentioned here: http://mblab.wustl.edu/GTF22.html, but I never checked... I would love to know how it is nowadays, and if it has changed, when they did...

0
Entering edit mode