12.2% of protein coding transcripts in GRCh38.cds.all.fa (from Ensembl db) don't start with 'ATG'. Why is this?
2
1
Entering edit mode
7.1 years ago
pwg46 ▴ 480

Please refer to the title. Are some of the CDS in Ensembl's cds file simply bad data? 

ensembl cds grch38 transcript atg • 2.7k views
ADD COMMENT
1
Entering edit mode

Its known but maybe its more common than previously thought: http://en.wikipedia.org/wiki/Start_codon#Eukaryotes

ADD REPLY
5
Entering edit mode
7.1 years ago
Denise CS ★ 5.2k

No, it's not bad data. In addition to the non-AUG start, it could be because they are annotated as CDS 5' incomplete and the start was therefore left open. See this example. CDS 5' (or CDS 3') incomplete transcripts are manually annotated by the HAVANA team and displayed in Ensembl as part of the GENCODE gene set.

ADD COMMENT
1
Entering edit mode
7.1 years ago
Chris S. ▴ 310

You can also check the fasta deflines in the file and specifically the status and transcript types.  I count 58 CDS or 0.25% using the ensembl_havana_transcript:known.  And here's some R code with details.

url <- "ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz"
download.file(url)
system("gunzip GRCh38.cds.all.fa.gz")
cds <- readDNAStringSet("GRCh38.cds.all.fa")

length(cds)
[1] 99436

status <- gsub("[^ ]+ ([^ ]+).*", "\\1", names(cds) )
atg <-  substr(cds, 1,3)=="ATG"
ttype <- gsub(".*transcript_biotype:([^ ]+).*", "\\1", names(cds) )

table(status, atg)

                                    atg
status                               FALSE  TRUE
  ensembl_havana_transcript:known      102 22937
  ensembl_havana_transcript:novel       35  1611
  ensembl_havana_transcript:putative    37  1250
  ensembl:known                        266 10041
  ensembl:novel                         83   434
  havana:known                        4620 31920
  havana:novel                        2617  6072
  havana:putative                     4222 12728

table(atg[status=="ensembl_havana_transcript:known" & ttype=="protein_coding"])

FALSE  TRUE
   58 22511
 
ADD COMMENT

Login before adding your answer.

Traffic: 2148 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6