Question: Still confused about exons versus CDS
4
gravatar for lilla.davim
5.6 years ago by
lilla.davim110
France
lilla.davim110 wrote:

Hello,

I thought I had understood the difference between the 2 terms but I am afraid I still need a clear explanation. Is the following correct?

  • Exon: A sequence which remains present in a mature RNA.
  • CDS: A sequence which remains present in a mature RNA and codes for a protein (i.e. gets translated).

Based on these definitions, I would expect that CDS are necessarily included in exons. Now in the UCSC online page for "Get Genomic Sequence Near Gene", I have the following (exclusive) displaying choice: 

  1. Exons in upper case, everything else in lower case
  2. CDS in upper case, UTR in lower case

I would therefore expect that when I select option 2, there are less nucleotides in upper case than in option 1.

But if I compare the results for the 2 options on the same sequence, I observe the following:

  • A) Entire sequences in upper case in option 1 become lower case in option 2
  • B) Entire sequences in lower case in option 1 become upper case in option 2

I can understand A (part of the exons which are UTR and thus non-coding become lower case in option 2), but I don't understand at all why B also happens.

Any clue?

Thanks for your help.

exon cds • 13k views
ADD COMMENTlink modified 5.6 years ago by Bert Overduin3.6k • written 5.6 years ago by lilla.davim110

Hello Adrian,

Ok so your definitions correspond to mine, i.e., CDS are included in exons.

You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

I guess option 2 should always have more (or equal) lower case bases than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

But in this case, why does B happen, and everything is not simply lower case with option 2??

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by lilla.davim110

B) should never happen. Can you give an example gene? That would seem to be an error in the annotation (though the UCSC annotations aren't that great, use Ensembl).
 

ADD REPLYlink written 5.6 years ago by Devon Ryan93k

Here is an example in the 1st entry of the following fasta, which goes from lower to upper case. All other parameters are the same, only the display options are different:

With option 1:

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=381409833_XLyyszThcNrKOH4dtiDua1kPTT1k&g=htcDnaNearGene&i=uc002wxs.3&c=chr20&l=30946146&r=31027122&o=knownGene&boolshad.hgSeq.promoter=0&hgSeq.promoterSize=1000&hgSeq.utrExon5=on&boolshad.hgSeq.utrExon5=0&hgSeq.cdsExon=on&boolshad.hgSeq.cdsExon=0&hgSeq.utrExon3=on&boolshad.hgSeq.utrExon3=0&hgSeq.intron=on&boolshad.hgSeq.intron=0&boolshad.hgSeq.downstream=0&hgSeq.downstreamSize=1000&hgSeq.granularity=feature&hgSeq.padding5=0&hgSeq.padding3=0&hgSeq.splitCDSUTR=on&boolshad.hgSeq.splitCDSUTR=0&hgSeq.casing=cds&boolshad.hgSeq.maskRepeats=0&hgSeq.repMasking=lower&submit=submit

With option 2:

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=381409833_XLyyszThcNrKOH4dtiDua1kPTT1k&g=htcDnaNearGene&i=uc002wxs.3&c=chr20&l=30946146&r=31027122&o=knownGene&boolshad.hgSeq.promoter=0&hgSeq.promoterSize=1000&hgSeq.utrExon5=on&boolshad.hgSeq.utrExon5=0&hgSeq.cdsExon=on&boolshad.hgSeq.cdsExon=0&hgSeq.utrExon3=on&boolshad.hgSeq.utrExon3=0&hgSeq.intron=on&boolshad.hgSeq.intron=0&boolshad.hgSeq.downstream=0&hgSeq.downstreamSize=1000&hgSeq.granularity=feature&hgSeq.padding5=0&hgSeq.padding3=0&hgSeq.splitCDSUTR=on&boolshad.hgSeq.splitCDSUTR=0&hgSeq.casing=exon&boolshad.hgSeq.maskRepeats=0&hgSeq.repMasking=lower&submit=submit

 

ADD REPLYlink written 5.6 years ago by lilla.davim110

If you tell it to include introns and select "CDS in upper case, UTR in lower case", then the case of the introns will probably be whatever it is in the genome to begin with (upper case in the example you gave). There's no option for "CDS in upper case, everything else in lower case" as there is for exons.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Devon Ryan93k
4
gravatar for Adrian Pelin
5.6 years ago by
Adrian Pelin2.3k
Canada
Adrian Pelin2.3k wrote:

Exons = gene - introns

CDS = gene - introns - UTRs

therefore also:

CDS = Exons - UTRs

Hope this helps in clarifying things. It depends what organism you are looking at for your expectation to be true. You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

 

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Adrian Pelin2.3k
2
gravatar for Bert Overduin
5.6 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

Hello Lilla,

Your understanding of exons and CDS is correct.

It's just that the UCSC formatting options are confusing. Some experimenting myself suggests that:

"Exons in upper case, everything else in lower case" means:

  • UTRs in upper case
  • CDS in upper case
  • introns in lower case

"CDS in upper case, UTR in lower case" means:

  • UTRs in lower case
  • CDS in upper case
  • introns in upper case (!!!!)

Hope this explains.

 

 

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Bert Overduin3.6k

Hello,

Thanks a lot for your reply, which makes things much clearer now. The only thing which actually remains completely unclear is UCSC's  rationale for implementing things this way! Btw is there any other method (using UCSC, Ensembl or else) to generate, for a given sequence in an assembly (GRCh37 or GRCh38 is ok for me) CDS in upper cases and everything else in lower case? 

Thanks!
 

ADD REPLYlink written 5.6 years ago by lilla.davim110

That is a question for the UCSC genome browser team, I'm afraid.

As for an easy way to get CDSs in upper case and the rest of the sequence in lower case, I am not aware of any. You also should keep in mind that doing this for a transcript sequence and doing this for a genomic sequence can give different results. While a transcript has only one CDS (or none, in case it is non-coding), a genomic sequence can, because of alternative splicing of transcripts, contain various CDSs, that can (partially) overlap each other.

 

ADD REPLYlink written 5.6 years ago by Bert Overduin3.6k

You could always just use R or biopython/bioperl. They'd take longer to get what you want, but then you would know that the output is exactly what's desired.
 

ADD REPLYlink written 5.6 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1782 users visited in the last hour