Question: Find stop codon for coding transcript in gencode gtf if not given?
1
gravatar for mary.a.wood.91
20 months ago by
mary.a.wood.9110 wrote:

I'm trying to determine how to find the stop codon position for a protein coding transcript if there is no stop codon feature listed for it in a gencode gtf file.

For example, an insulin transcript (ENST00000421783.1) is listed as a protein coding transcript in the gencode GRCh37 gtf, and has start codon, CDS, exon features listed, but no stop codon:

chr11   HAVANA  transcript  2181013 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr11   HAVANA  exon    2182015 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2161158"; remap_status "full_contig";
chr11   HAVANA  CDS 2182015 2182201 .   -   0   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2160971"; remap_status "full_contig";
chr11   HAVANA  start_codon 2182199 2182201 .   -   0   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160969-2160971"; remap_status "full_contig";
chr11   HAVANA  exon    2181013 2181102 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11   HAVANA  CDS 2181013 2181102 .   -   2   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11   HAVANA  UTR 2182202 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160972-2161158"; remap_status "full_contig";

If you look at the transcript sequence in ensembl (http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000254647;r=11:2181013-2182388;t=ENST00000421783 ), there does not appear to be an in-frame stop codon. Is this truly a protein-coding transcript?

In general, it also seems like there aren't stop codon features given for a large proportion of transcripts in the gtf file. How can you determine the stop codon positions for these sequences without having to search through nucleotide sequence for each transcript?

gencode stop codon ensembl gtf • 963 views
ADD COMMENTlink modified 19 months ago by Istvan Albert ♦♦ 80k • written 20 months ago by mary.a.wood.9110
1

Any specific reason you are still using GRCh37? In GRCh38 and CRCh37 this transcript is annotated as having incomplete 3' CDS.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax70k

This was just one example - but this means the annotation is incomplete, right? And this seems to be the case for a lot of transcripts. Why do these annotations end up incomplete, and why so often? Is resolving a stop codon fairly difficult?

ADD REPLYlink written 20 months ago by mary.a.wood.9110
1

Since the transcript has been retained over time there must be enough evidence of its presence but clearly there the full sequence is lacking. That may be the case with many rare/alternate transcripts.

ADD REPLYlink written 19 months ago by genomax70k

You're quite right genomax, and the 'CDS 3'incomplete' flag in the transcript table is also present in Ensembl GRCh37 and indicates that this information is missing. There is also protein evidence for this transcript from UniProtKB as you can see in the transcript table. You can look at the 'Supporting Evidence' section in the transcript tab to see what evidence has been used to support the transcript structure. Therefore there is evidence of a protein product, but the cDNA or EST evidence is not present to support the full length of the non-coding sections of the transcript.

ADD REPLYlink written 19 months ago by Erin_Ensembl380
1
gravatar for Istvan Albert
19 months ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

By the definition, the start and stop codons should be included in features named CDS

CDS: A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

Hence the end coordinate of a CDS should indicate the last base of the stop codon.

In practice, and surprisingly enough, there are inconsistencies and not all data sources obey the standard when naming the features. The easiest way to check is to load your GFF into IGV then visually verify where start/stop codons are (use the Show Translations) feature on the sequence track.

ADD COMMENTlink modified 19 months ago • written 19 months ago by Istvan Albert ♦♦ 80k

My curiosity is why so many protein coding transcripts are poorly annotated (i.e. lack an annotated start or stop codon). Around 1/3 of the transcripts described as protein coding in the gencode 27 release GTF file lack a start codon feature, stop codon feature, or both. Why is it that these features are so inconsistently available?

ADD REPLYlink written 19 months ago by mary.a.wood.9110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1463 users visited in the last hour