What's the difference between the terms CDS and ORF?
What's the difference between the terms CDS and ORF?
In more details:
The region of the nucleotide sequences from the start codon (ATG) to the stop codon is called the Open Reading frame.
Gene finding in organism specially prokaryotes starts form searching for an open reading frames (ORF). An ORF is a sequence of DNA that starts with start codon “ATG” (not always) and ends with any of the three termination codons (TAA, TAG, TGA). Depending on the starting point, there are six possible ways (three on forward strand and three on complementary strand) of translating any nucleotide sequence into amino acid sequence according to the genetic code .These are called reading frames.
While eukaryotic gene finding is altogether a different task as the eukaryotic genes are not continuous and interrupted by intervening noncoding sequences called ‘introns’. Moreover organization of genetic information in eukaryotes and prokaryotes is different.
The Coding Sequence (CDS) is the actual region of DNA that is translated to form proteins. While the ORF may contain introns as well, the CDS refers to those nucleotides(concatenated exons) that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery.
Mainly: CDS means only that the sequence is known to be transcribed and, therefore, it is coding for something -- neither gene nor protein has to be known. Any full mRNA sequence (obtained from cDNA sequencing) will have a full coding sequence. ORF is usually predicted based on DNA sequence and not proven to be transcribed.
There is a lot of evidence of non-canonical translation that begins at non-AUG sense codons. For example, translation may begin at CUG, GUG or ACG (see http://www.sciencedirect.com/science/article/pii/0378111990900856). It is therefore more meaningful to define ORFs as stop to stop (rather than start to stop).
ORF (Open Reading Frame) is best seen as a hypothesis of a protein coding region. It is the stretch of DNA between a start codon and the next stop codon. It is not a hypothesis of the whole protein coding region in eukaryotes (due to introns). CDS should be the whole coding region.
Both those start/stop 'codons' could be just randomly found in an intergenic region that does not actually code for any protein- so not every ORF means a protein. An ORF will be found between the actual start codon of a protein coding gene and the next stop codon. It is quite possible that this stop codon will be found in an intron, in which case the ORF includes an exon and part of an intron. Since introns are mostly just random sequence a stop codon could just occur by chance. If the intron by chance does not contain a stop 'codon' (ie 3 nucleotides TAA/TAG/TGA in the same reading frame as the exon) then the ORF will continue until it meets a stop codon- either randomly in the next intron, else a genuine stop at the end of the gene.
If the intron without a stop is not a multiple of 3 nucleotides, then it will introduce a frameshift, and the next stop could easily occur within the next exon. If it is a multiple of 3 it will introduce false amino acids into the ORF as it continues through the intron and into the exon. These sorts of errors are not uncommon in gene annotation, since intron detection is complex, and if it 'reads through' the intron might not be annotated until cDNA sequences are compared to the genome sequence.
If you want to see a demonstration of these ideas try getting a sequence from GenBank for a gene that contains a leader sequence 5'-UTR, exons, introns, 3'UTR. The CDS will be annotated as such and will just be exonic regions. Take this gene sequence and use NCBI ORF-Finder which will outline all the potential ORFs. Some of these, but not all, will be the actual coding parts.
CDS - coding dna sequence - > only sequence that is translated into protein
ORF - open reading frame -> entire gene sequence 5'-utr + transcript (all exons + introns) + 3'-utr
an ORF is the part of the mRNA sequence, starting at an intiation codon (usually AUG), that terminates either at a stop codon (TAA, TAG or TGA for the standard genetic code), or at the end of the sequence, if no stop codon is found in the same phase; the later case meaning that the mRNA sequence is incomplete. Usually, the AUG codon is embedded in a longer less defined sequence (for example, Kozak sequence for vertebrates).
I would define an open reading frame (ORF) as any stretch of nucleotide sequence from start to top codon (coding or not coding for protein), whereas a coding sequence (CDS) is a nucleotide sequence that is believed to code for protein. A CDS can correspond to an individual exon of a protein-coding gene or represent the complete (spliced) sequence of a protein-coding transcript.
Multiple genes can be encoded in a single reading frame of prokariotes
Therefore, besides intron removal which was mentioned in this other answer, this is another important difference between what actually gets transcribed (Orf) and translated (Cds), and therefore further motivates their distinction.
Such open reading frames are called "multicistronic" and are described for example in this article: https://blog.addgene.org/plasmids-101-multicistronic-vectors That article mentions two mechanisms by which this can work:
Viruses (notably positive RNA ones) also have techniques to allow a single mRNA to be translated to multiple proteins, this is mentioned for example on this presentation about the COVID-19 virus.
An easy sample from Wiki to understand difference between ORF and CDS. Sample sequence showing three different possible reading frames. Start codons are highlighted in purple, and stop codons are highlighted in red.
Can CDS contain sequences that aren't exons ? I'm asking because i found a cds that is longer than joined exons.
cited in http://www.bmbtrj.org/article.asp?issn=2588-9834;year=2018;volume=2;issue=3;spage=163;epage=167;aulast=Dwivedi#ft13
Is this the first citation for a biostars post?