Obtaining only coding exons from UCSC table browser
2
0
Entering edit mode
8.8 years ago

Hi all. I'm having trouble producing a file that contains only the coding exons that do not contain UTR's. I've obtained a file from UCSC table browser that looks something like this:

#name    cdsStart    cdsEnd    exonCount    exonStarts    exonEnds
NM_017436    43088895    43089957    3    43088126,43091496,43116802,    43090003,43091637,43116876,
NM_001173466    53701272    53715249    15    53701239,53701628,53701835,53702065,53702218,53702508,53702743,53702940,53703384,53708081,53708877,53709118,53709510,53714348,53715126,    53701497,53701713,53701917,53702133,53702312,53702599,53702804,53703065,53703505,53708225,53708924,53709210,53709566,53714476,53715412,

I have the cdsStart and cdsEnd but what I want to do is to incorporate those starts and ends into the exonStarts and exonEnds so I can use this file for further analysis. For example, this is what I would want my output to look like:

#name    cdsStart    cdsEnd    exonCount    exonStarts    exonEnds
NM_017436    43088895    43089957    3    43088895,    43089957,

For this example, the cdsStart and cdsEnds were in the first exon and thus I only wanted these exons to appear in my file. Is there any easy way to to carry this out from the table browser or do I need to modify the file? If so, any suggestions on how to do that?

Thank you!

python • 1.9k views
ADD COMMENT
2
Entering edit mode
8.8 years ago

if you select the gene track of interest, ask for a particular region (say chr21:33031597-33041570), request the output to be in BED format, and in the next page check "exons plus 0 bases at each end", you'll end up with this format that may be what you're looking for:

chr21   33031934    33032154    NM_000454_exon_0_0_chr21_33031935_f 0   +
chr21   33036102    33036199    NM_000454_exon_1_0_chr21_33036103_f 0   +
chr21   33038761    33038831    NM_000454_exon_2_0_chr21_33038762_f 0   +
chr21   33039570    33039688    NM_000454_exon_3_0_chr21_33039571_f 0   +
chr21   33040783    33041243    NM_000454_exon_4_0_chr21_33040784_f 0   +
ADD COMMENT
0
Entering edit mode

This is the best solution if you don't need to bulk process data using MySQL.

ADD REPLY
0
Entering edit mode

Does this give you every individual exon for the region? Are they only coding exons?

ADD REPLY
0
Entering edit mode

there's a option to select "coding exons" instead of "exons plus X bases at each end"

ADD REPLY
0
Entering edit mode

Perfect. Thank you!

ADD REPLY
0
Entering edit mode
8.8 years ago
Ram 43k

Read each line into an object with exonStarts and exonEnds as arrays. Replace the first element of the exonStarts array with the cdsStart value and the last element of the exonEnds array with the cdsEnd value.

What's curious is that this was a program one of my friends had to write as part of an interview. Is that the case with you as well?

ADD COMMENT
0
Entering edit mode

I've been able to switch the first and last values like you proposed, but the problem is that the cdsStart and cdsEnd is not necessarily in the first or last exons.

haha. Not for an interview. Just an intermediate step for further analysis

ADD REPLY
0
Entering edit mode

Oh, I forgot that cases exist where entire exons can be UTRs. To address the worst case scenario, you can process all exons.

Compare each exonEnd to cdsStart and cdsEnd. If exonStarts[I]>cdsStart, exonStarts[i-1]=cdsStart. You will encounter this first, so stop checking for cdsStart after you assign cdsStart to the right exonStart (maybe set a flag). Similarly, if exonEnds[I]>cdsEnd, exonEnds[i]=cdsEnd. Exit loop.

This will work as long as you're processing both exonStarts and exonEnds arrays simultaneously.

ADD REPLY

Login before adding your answer.

Traffic: 2566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6