Question

12.2% of protein coding transcripts in GRCh38.cds.all.fa (from Ensembl db) don't start with 'ATG'. Why is this?

1

Entering edit mode

9.4 years ago

pwg46 ▴ 540

Please refer to the title. Are some of the CDS in Ensembl's cds file simply bad data?

grch38 cds transcript ensembl • 3.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by pwg46 ▴ 540

1

Entering edit mode

Its known but maybe its more common than previously thought: http://en.wikipedia.org/wiki/Start_codon#Eukaryotes

ADD REPLY • link 9.4 years ago by Ying W ★ 4.2k

Ram · Answer 1 · 2014-11-18

5

Entering edit mode

9.4 years ago

Denise CS ★ 5.2k

No, it's not bad data. In addition to the non-AUG start, it could be because they are annotated as CDS 5' incomplete and the start was therefore left open. See this example. CDS 5' (or CDS 3') incomplete transcripts are manually annotated by the HAVANA team and displayed in Ensembl as part of the GENCODE gene set.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Denise CS ★ 5.2k

Ram · Answer 2 · 2014-11-18

You can also check the fasta deflines in the file and specifically the status and transcript types. I count 58 CDS or 0.25% using the ensembl_havana_transcript:known. And here's some R code with details.

url <- "ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz"
download.file(url)
system("gunzip GRCh38.cds.all.fa.gz")
cds <- readDNAStringSet("GRCh38.cds.all.fa")

length(cds)
[1] 99436

status <- gsub("[^ ]+ ([^ ]+).*", "\\1", names(cds) )
atg <-  substr(cds, 1,3)=="ATG"
ttype <- gsub(".*transcript_biotype:([^ ]+).*", "\\1", names(cds) )

table(status, atg)

                                    atg
status                               FALSE  TRUE
  ensembl_havana_transcript:known      102 22937
  ensembl_havana_transcript:novel       35  1611
  ensembl_havana_transcript:putative    37  1250
  ensembl:known                        266 10041
  ensembl:novel                         83   434
  havana:known                        4620 31920
  havana:novel                        2617  6072
  havana:putative                     4222 12728

table(atg[status=="ensembl_havana_transcript:known" & ttype=="protein_coding"])

FALSE  TRUE
   58 22511