Exon Coordinates Of Hg19 Genome Download
3
4
Entering edit mode
13.1 years ago
Neeraj ▴ 150

Hi,

can any one help me in downloading the exon coordinates of all the genes present in the human genome hg19.

neeraj

exon coordinates hg human • 25k views
ADD COMMENT
2
Entering edit mode

Mostly, although that question asked for a Bioperl solution.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Mostly, all though that question asked for a Bioperl solution.

ADD REPLY
20
Entering edit mode
13.1 years ago

Using mysql:

mysql -u anonymous -h ensembldb.ensembl.org -P 5306 -D homo_sapiens_core_61_37f -A 
-e 'select S.stable_id,R.name,E.seq_region_start,E.seq_region_end,E.seq_region_strand from exon as E,seq_region as R,exon_stable_id as S where R.seq_region_id=E.seq_region_id and S.exon_id=E.exon_id'
+-----------------+------+------------------+----------------+-------------------+
| stable_id       | name | seq_region_start | seq_region_end | seq_region_strand |
+-----------------+------+------------------+----------------+-------------------+
| ENSE00002029850 | 5    |         94120533 |       94120602 |                -1 | 
| ENSE00002069321 | 4    |         17835922 |       17836146 |                 1 | 
| ENSE00002048418 | 5    |        123731640 |      123731794 |                -1 | 
| ENSE00001815244 | 6    |         13711167 |       13711796 |                -1 | 
| ENSE00001363151 | 2    |          1507720 |        1507851 |                 1 | 
| ENSE00001737796 | 1    |         40537122 |       40537924 |                 1 | 
| ENSE00001800436 | 2    |        165208630 |      165208733 |                -1 | 
| ENSE00001255746 | 10   |         93683822 |       93683847 |                 1 | 
| ENSE00001844789 | 14   |         74523609 |       74523683 |                 1 | 
| ENSE00002137765 | 8    |         17541844 |       17542051 |                -1 | 
(...)

Using UCSC & awk:

curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c |\
 awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s\n",$1,$2,$3,S[i],E[i]);} }' 
uc001aaa.3,chr1,+,11873,12227
uc001aaa.3,chr1,+,12612,12721
uc001aaa.3,chr1,+,13220,14409
uc010nxq.1,chr1,+,11873,12227
uc010nxq.1,chr1,+,12594,12721
uc010nxq.1,chr1,+,13402,14409
uc010nxr.1,chr1,+,11873,12227
uc010nxr.1,chr1,+,12645,12697
uc010nxr.1,chr1,+,13220,14409
uc009vis.2,chr1,-,14362,14829
ADD COMMENT
1
Entering edit mode

split($9,S,/,/) = split the column $9 ($9 is a comma-separated list of exonStarts) and put the result into the variable S. split($10,E,/,/) = split the column $10 ($10 is a comma-separated list of exonEnds) and put the result into the variable E.

ADD REPLY
0
Entering edit mode

@Pierre can you explain the awk command a little? it's hard for me to follow (but i already upvoted anyway)

ADD REPLY
0
Entering edit mode

and $8 is the number of exons

ADD REPLY
0
Entering edit mode

thanks Pierre, that helps.

ADD REPLY
0
Entering edit mode

I have an issue with this... if I want only the exons of the main isoform, how can I extract them? because from this file there are some items that are the same, like uc001aaa.3,chr1,+,11873,12227 or uc010nxq.1,chr1,+,11873,12227

ADD REPLY
5
Entering edit mode
13.1 years ago

several ways of doing this have been previously mentioned. the one I like the most because of its simplicity is using BioMart, selecting "martview", choosing the latest "Ensembl genes" database and the latest human dataset, and then selecting the attributes needed on the "structures" section (there you will have an "exon" subsection with "Exon Chr Start (bp)" and "Exon Chr End (bp)") without applying any filter at all.

ADD COMMENT
0
Entering edit mode

Thanx Jorge it really helps me.Thanx a lot

ADD REPLY
0
Entering edit mode

Why the result of using Biomart is different from the result of using ensemble API ?

ADD REPLY
3
Entering edit mode
13.1 years ago
Neilfws 49k

Search BioStar and you will find a number of solutions to this problem. Mostly they use BioMart, as outlined by Jorge, or the UCSC genome browser database tables, as described in the answer pointed to by Pierre.

One point: which set of exon coordinates do you want? There are several, depending on the gene prediction method used - e.g. UCSC, RefSeq or Ensembl transcripts.

If you start from UCSC tables, select "mammal, human, hg19" and then "Genes and Gene Prediction tracks" under "group", you will see the various gene models. Select one of those and choose "describe table schema" to see how exons are stored. Then go back to the main tables page, from where you should be able to download the exon data. This can also be done programmatically or through a SQL query to the UCSC MySQL server.

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6