Question

Retrieve Coding Exons In Ucsc Table Browser

6

Entering edit mode

13.2 years ago

User 6659 ▴ 970

Hi

I'm sorry for the basic question but I am confused about the UCSC table browser and would really appreciate some clarification.

I believe the UCSC table browser is based on an underlying genomic database which can be considered to be conceptually analagous to the, say, the ensemlbl database in that it contains tables of information about genomic features.

I believe tables that have positional information can be displayed as tracks on the genome browser. These are the main tables. I believe that other tables that contain descriptive information about the main tables are auxilliary tables. Both types of table can be downloaded as text files from the browser but information in auxillary tables is obtained by linking from a main table.

I don't understand how the database handles basic biological models such as the fact a gene has many exons which can be arranged in different combinations to give different transcripts. For example in the KnownGenes tables i was able to find a list of exons for a gene but I didn't know how to find the transcripts for a gene. The best I could find was a table called all_mrna which seemed to link back to the refseq gene table by a field called qname. But i couldn't tell from looking at the mRNA which of the gene exons were in the mRNA.

But if you wanted to download the sequence of the known genes table you can get just the coding exons so how does the browser know which exons are transcribed?

I don't understand how this is done. How would i use the table browser to get table data for all coding exons in a format such as chr, start, end? An example of an sql query to achieve this against the underlying database would be much appreciated

thanks a lot

ucsc exon • 18k views

ADD COMMENT • link updated 13.1 years ago by Paulo Nuin ★ 3.7k • written 13.2 years ago by User 6659 ▴ 970

0

Entering edit mode

Can anyone provide the sql query?

ADD REPLY • link 13.1 years ago by User 6659 ▴ 970

score 12 · Answer 1 · 2011-03-05

12

Entering edit mode

13.2 years ago

Sequencegeek ▴ 740

Use knownGene.txt to get transcripts with exon annotations

To get only the coding exons, I usually parse the knownGene.txt table. Inside it includes the exon starts, exon ends, codingStart, and codingEnd coordinate (in 0-start 1-end format) for each transcript. Also, the gene ID is referenced for every row (transcript).

There will be a one to many relationship between gene ID and transcript ID, but to get all coding exons you don't need to worry about which genes each transcript belongs to.

How to get the coding exons

The coding exons are the (transcript exons - UTRs). Note that the UTR can span multiple exons. So for each transcript you need to know which exon the 5'/3' UTR is within and truncate it. Here's an example (Note this will look a little different than UCSC):

transcript Name: uc002quc.3
strand: +
TSS: 57058
TSE: 68122
coding Start: 57248
coding End: 67717
Number of exons: 8
Exon Starts: 57058,57487,58389,59553,62435,66729,67295,67447,
Exon Ends: 57288,57523,58674,59847,62486,66834,67348,68122,
Gene Name: KIR2DL5B
5' Coding Stat: na
3' Coding Stat: na

The 5' coding start is within the first exon (57058 < 57248 < 57288). So if you replace the exon start "57058" with "57248" and do something similar to the exon end and the 3'UTR, you will be left with only the coding exons.

Odd Scenarios

Often times the 3'UTR or 5'UTR will be unannotated, or the transcript could be non-coding.

If it is unannotated, the coding start and/or coding end will be the same as the transcription start site.

I'm not sure the best way to filter these out, but I personally cross-reference the geneIDs to another UCSC table that has coding/nonCoding information and mark them as such - you don't want exons in you coding list if they aren't translated (Non-Coding gene).

Best of Luck.

EDIT: The gene ID is NOT referenced for every row. I'm honestly not sure how to construct genes easily via UCSC, I usually collect transcripts from every database and make a custom gene set.

ADD COMMENT • link 13.2 years ago by Sequencegeek ▴ 740

0

Entering edit mode

thanks for your reply. what do you mean that there is cds start/end for each transcript. Do you mean there are multiple rows per gene: one per transcript? Also what do you mean that 3'utr or 5'utr is unannotated? Where are they annotated? If they are unannotated Do you mean that the cds start and end are blank because they are the same as the tx start and end

ADD REPLY • link 13.2 years ago by User 6659 ▴ 970

0

Entering edit mode

Every row in knownGene is a transcript. So, yes, a gene can be thought to be made up of multiple rows (which ones is not a trivial question). The 6th and 7th columns of the knownGene table contain coding start and end site positions. Because a transcript has untranslated regions (5' and 3'). EXON DOES NOT EQUAL CODING. When DNA is transcribed into RNA, the RNA will be spliced and the regions denoted by exons will be ligated into the mRNA. But this mRNA contains the 3' and 5' UTR. So to get the CODING exons you must know the position where coding starts and stops (6th and 7th columns).

ADD REPLY • link 13.2 years ago by Sequencegeek ▴ 740

0

Entering edit mode

Re: unannotated In essence they are unknown, but instead of leaving it blank it is instead set equal to the transcription start site. I'm not sure why this is...

I just mentioned it because it has caused me problems in the past...

ADD REPLY • link 13.2 years ago by Sequencegeek ▴ 740

0

Entering edit mode

that is a useful catch to know! thanks a lot

ADD REPLY • link 13.2 years ago by User 6659 ▴ 970

0

Entering edit mode

i've just spotted your edit. what do you mean the gene id isn't referenced for every row. I've just downloaded genes from a human chromosome and every row has a gene id. Do you mean not every gene has coding/non coding info in other tables?

ADD REPLY • link 13.2 years ago by User 6659 ▴ 970

0

Entering edit mode

i've just spotted your edit. what do you mean the gene id isn't referenced for every row. Do you mean not every gene has coding/non coding info in other tables?

ADD REPLY • link 13.2 years ago by User 6659 ▴ 970

0

Entering edit mode

Sorry for the late reply, UCSC apparently doesn't have concrete gene IDs, but you can use a separate table to use the "ucx009.id"-type id and cross-ref it to a standard gene symbol. I can't seem to find that table right now, though...

ADD REPLY • link 13.1 years ago by Sequencegeek ▴ 740

score 2 · Answer 2 · 2011-03-06

2

Entering edit mode

13.1 years ago

Paulo Nuin ★ 3.7k

Biomart is your best option, simpler, cleaner and faster than UCSC, even though is based on Perl.

ADD COMMENT • link 13.1 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

hi - i'm not planning on using ucsc. I was just investigating it. It seemed remiss not to understand how a tool has famous as this works.

ADD REPLY • link 13.1 years ago by User 6659 ▴ 970