3
4
Entering edit mode
10.5 years ago
Neeraj ▴ 150

Hi,

can any one help me in downloading the exon coordinates of all the genes present in the human genome hg19.

neeraj

exon coordinates hg human • 22k views
2
Entering edit mode

Mostly, although that question asked for a Bioperl solution.

1
Entering edit mode
0
Entering edit mode

Mostly, all though that question asked for a Bioperl solution.

20
Entering edit mode
10.5 years ago

Using mysql:

mysql -u anonymous -h ensembldb.ensembl.org -P 5306 -D homo_sapiens_core_61_37f -A
-e 'select S.stable_id,R.name,E.seq_region_start,E.seq_region_end,E.seq_region_strand from exon as E,seq_region as R,exon_stable_id as S where R.seq_region_id=E.seq_region_id and S.exon_id=E.exon_id'
+-----------------+------+------------------+----------------+-------------------+
| stable_id       | name | seq_region_start | seq_region_end | seq_region_strand |
+-----------------+------+------------------+----------------+-------------------+
| ENSE00002029850 | 5    |         94120533 |       94120602 |                -1 |
| ENSE00002069321 | 4    |         17835922 |       17836146 |                 1 |
| ENSE00002048418 | 5    |        123731640 |      123731794 |                -1 |
| ENSE00001815244 | 6    |         13711167 |       13711796 |                -1 |
| ENSE00001363151 | 2    |          1507720 |        1507851 |                 1 |
| ENSE00001737796 | 1    |         40537122 |       40537924 |                 1 |
| ENSE00001800436 | 2    |        165208630 |      165208733 |                -1 |
| ENSE00001255746 | 10   |         93683822 |       93683847 |                 1 |
| ENSE00001844789 | 14   |         74523609 |       74523683 |                 1 |
| ENSE00002137765 | 8    |         17541844 |       17542051 |                -1 |
(...)


Using UCSC & awk:

curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c |\
awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s\n",$1,$2,$3,S[i],E[i]);} }'
uc001aaa.3,chr1,+,11873,12227
uc001aaa.3,chr1,+,12612,12721
uc001aaa.3,chr1,+,13220,14409
uc010nxq.1,chr1,+,11873,12227
uc010nxq.1,chr1,+,12594,12721
uc010nxq.1,chr1,+,13402,14409
uc010nxr.1,chr1,+,11873,12227
uc010nxr.1,chr1,+,12645,12697
uc010nxr.1,chr1,+,13220,14409
uc009vis.2,chr1,-,14362,14829

1
Entering edit mode

split($9,S,/,/) = split the column $9 ($9 is a comma-separated list of exonStarts) and put the result into the variable S. split($10,E,/,/) = split the column $10 ($10 is a comma-separated list of exonEnds) and put the result into the variable E.

0
Entering edit mode

@Pierre can you explain the awk command a little? it's hard for me to follow (but i already upvoted anyway)

0
Entering edit mode

and \$8 is the number of exons

0
Entering edit mode

thanks Pierre, that helps.

0
Entering edit mode

I have an issue with this... if I want only the exons of the main isoform, how can I extract them? because from this file there are some items that are the same, like uc001aaa.3,chr1,+,11873,12227 or uc010nxq.1,chr1,+,11873,12227

5
Entering edit mode
10.5 years ago

several ways of doing this have been previously mentioned. the one I like the most because of its simplicity is using BioMart, selecting "martview", choosing the latest "Ensembl genes" database and the latest human dataset, and then selecting the attributes needed on the "structures" section (there you will have an "exon" subsection with "Exon Chr Start (bp)" and "Exon Chr End (bp)") without applying any filter at all.

0
Entering edit mode

Thanx Jorge it really helps me.Thanx a lot

0
Entering edit mode

Why the result of using Biomart is different from the result of using ensemble API ?

3
Entering edit mode
10.5 years ago
Neilfws 49k

Search BioStar and you will find a number of solutions to this problem. Mostly they use BioMart, as outlined by Jorge, or the UCSC genome browser database tables, as described in the answer pointed to by Pierre.

One point: which set of exon coordinates do you want? There are several, depending on the gene prediction method used - e.g. UCSC, RefSeq or Ensembl transcripts.

If you start from UCSC tables, select "mammal, human, hg19" and then "Genes and Gene Prediction tracks" under "group", you will see the various gene models. Select one of those and choose "describe table schema" to see how exons are stored. Then go back to the main tables page, from where you should be able to download the exon data. This can also be done programmatically or through a SQL query to the UCSC MySQL server.