Question: Why Total Exon Length Of 2/3Rds Of Ucsc Knowngenes Is Non-Divisible By 3?
6
gravatar for Chronos
7.5 years ago by
Chronos570
Germany
Chronos570 wrote:

When examining the divisibility of total exon length of all UCSC knownGene transcripts (77614), 66.9% of them are not divisible, and even if I adjust for both UTRs and ignore all transcripts with cdsStart = cdsEnd (presumably non-coding), I still get 50% of non-divisible cDNAs.

Here's my script to evaluate divisibility:

#!/usr/bin/python
import fileinput
print "name", "length", "%3", "adj.", "adj%3"

# subset of columns: name, txStart, txEnd, cdsStart, cdsEnd, exonsStart, exonsEnd
for line in fileinput.input("7_columns.txt"):
    name, txs, txe, cds, cde, es, ee = line.strip("\n").split("\t")
    # exon starts and ends
    exons = es.strip(",").split(",")
    exone = ee.strip(",").split(",")

    total_length = 0

    if cds != cde:
        # adjust = min(delta(txEnd, cdsEnd), delta(txEnd, last_exon_start) +
        # min(delta(cdsStart, txStart), delta(first_exon_end, txEnd)) ; 
        # without these min()'s, applying adjust yields negative lengths
        adjust = min(int(txe) - int(cde), int(txe) - int(exons[-1])) + min(int(cds) - int(txs), int(exone[0]) - int(txe))
    else:
        # this N helps filter output for non-coding entries
        print "N",
        adjust = 0

    for e_index in range(len(exons)):
        # sum up all exon lengths
        total_length += int(exone[e_index]) - int(exons[e_index])
        # adjust for UTRs
        adjusted = total_length - adjust
    if adjusted % 3 != 0:
        print name, total_length, total_length % 3, adjusted, adjusted % 3
fileinput.close()

(I counted transcripts by piping output to grep -v N | wc -l.)

I did read http://biostar.stackexchange.com/questions/7042/are-all-exon-nucleotide-sequence-counts-divisible-by-3-in-the-human-genome first.

I'd appreciate any hints, including things to try and change in the script.

As a directly related question - how can cdsStart be within the second exon? What is the meaning of the 1st exon in this case? Here's an example (cdsStart is 3bps up from the end of 2nd exon):

select name, strand, txstart, txend, cdsstart, cdsend, exonstarts, exonends from knowngene where name='uc001aau.2';
name       | strand | txstart | txend  | cdsstart | cdsend |      exonstarts       |       exonends        
------------+--------+---------+--------+----------+--------+-----------------------+-----------------------
uc001aau.2 | +      |  323891 | 328580 |   324342 | 325605 | 323891,324287,324438, | 324060,324345,328580,
codon length exon ucsc • 2.2k views
ADD COMMENTlink modified 7.5 years ago by Dave Lunt2.0k • written 7.5 years ago by Chronos570

this would only be alarming if you could find an example where no splicing isoforms exist that are divisible by 3

ADD REPLYlink written 7.5 years ago by Jeremy Leipzig18k

Based on what Pierre has written below, it looks like UCSC transcripts themselves represent alternative splicing isoforms. How do I treat the "extra" nucleotides in non-divisible isoforms - just discard them? That doesn't seem right.

ADD REPLYlink written 7.5 years ago by Chronos570
9
gravatar for Pierre Lindenbaum
7.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I think there is a bug in your program: the translation can start in any exon, and not necessarily in the first exon. In that case, the 5' UTR is the result of the concatenation of several exons.

edit:

using the following awk script, I got only length%3==0:

BEGIN   {
    FS="[\t]";
    }

    {
    cdsStart=int($6);
    cdsEnd=int($7);
    if(cdsStart>=cdsEnd) next;
    exonCount=int($8);
    split($9,exonStarts,"[,]");
    split($10,exonEnds,"[,]");
    len=0;

    for(i=1;i<= exonCount;i++)
        {
        if(cdsStart>=exonEnds[i])
            {
            continue;
            }
        if(cdsEnd<exonStarts[i])
            {
            continue;
            }
        beg=exonStarts[i];
        if(cdsStart>=exonStarts[i] && cdsStart<exonEnds[i])
            {
            beg=cdsStart;
            }
        end=exonEnds[i];
        if(cdsEnd>=exonStarts[i] && cdsEnd<exonEnds[i])
            {
            end=cdsEnd;
            } 
        len+=(end-beg);
        }
    printf("%s\t%d\t%d\n",$3,len,len%3);
    }

usage:

mysql -N  (...) -D hg19 -e 'select * from knownGene  ' | awk -f script.awk
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Pierre Lindenbaum120k

Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct? I'll try that and will report back.

If translation starts in e.g. the 2nd exon - shouldn't everything before it be considered either an intron or a UTR or both? Or do UCSC transcripts also include alternative transcripts? (This hasn't occurred to me before, I'm more used to Ensembl's gene model with sub-transcripts.)

ADD REPLYlink written 7.5 years ago by Chronos570

I've updated my answer. See above.

ADD REPLYlink written 7.5 years ago by Pierre Lindenbaum120k

ok, I think the notion of all transcripts also including alternative splicing isoforms helps a lot - thanks! Now the only question is what I do with any "extra" nucleotides beyond 3*n - if there any after the recalculation.

ADD REPLYlink written 7.5 years ago by Chronos570

Thanks for the code!

ADD REPLYlink written 7.5 years ago by Chronos570

Thanks again, your interpretation does make all cDNAs divisible.

ADD REPLYlink written 7.5 years ago by Chronos570
2
gravatar for Dave Lunt
7.5 years ago by
Dave Lunt2.0k
Hull, UK
Dave Lunt2.0k wrote:

Hi chronos

it might also be worth thinking about the nature of exons, cDNA and CDS. Some of the difficulty might arise from the terminology? When you say "Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct?" I think you might mean CDS not cDNA?

Exons can occur in the 5'UTR, the coding region (CDS), or the 3'UTR. All these three regions are found in the mRNA, and therefore the cDNA, but only the CDS is translated into protein in the cell. (I'm assuming that you have filtered to deal only with protein-coding transcripts, there can be a number of RNA-encoding genes with exons in the genome too). You first need to find which exons are found exclusively within the CDS and which aren't, N.B. exons can also span the boundary between the CDS and either UTR. The sum of the intra-CDS exons should be divisable by 3 (although I guess there could be some weird exceptions with RNA-editing and the like), but if you include UTR-exons there is no reason your total should be divisible by 3 since they don't code for a protein.

A single 5'UTR exons is therefore why CDS-start can be in the second exon.

I am not sure how many exons in human map to each of these 3 regions but in zebrafish it is 2% 5'UTR, 0.5% 3'UTR and 97.5% CDS.

It sounds like you might have script issues sorted out now, but I thought this description might help (a bit) too.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Dave Lunt2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1617 users visited in the last hour