How to download all the CpG islands data of hg38 or hg19 in ucsc?
2
6
Entering edit mode
4.9 years ago
winjorchen ▴ 60

Hi friends: How can i download all the CpG islands data of hg38 or hg19 in ucsc? Are there have a CpG island database? thx

genome alignment sequence next-gen • 9.8k views
ADD COMMENT
9
Entering edit mode
4.9 years ago

For hg19, you can grab the cpgIslandExt table from UCSC's goldenpath service, and use BEDOPS sort-bed to build a sorted BED4+ file:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
   | sort-bed - \
   > cpgIslandExt.hg19.bed

Derived from the table schema for this file, the first four columns are the island's genomic interval and name. The remaining columns are island length, number of CpGs in the island, the number of C and G in the island, the percentage of island that is CpG, the percentage of island that is C or G, and the ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island.

You can do the same thing for hg38, with a slight tweak to the URL:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
   | sort-bed - \
   > cpgIslandExt.hg38.bed

The schema is the same between builds, but you can take a look at it here.

ADD COMMENT
1
Entering edit mode

Thanks for the answer! Unfortunately, it has an error. When you call awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }', you print a substring starting with the first occurrence of the string found in field 7. So, if the string found in field 7 also occurs earlier in the row, then it'll print from that point instead of field 7. Indeed, on the 11th line of the supplied file, you suddenly have 13 columns instead of 11.

Here's a longer but correct snippet: awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }'

And the whole code block:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }' \
   | sort-bed - \
   > cpgIslandExt.hg38.bed
ADD REPLY
0
Entering edit mode

thanks´╝îit is helpful!

ADD REPLY
4
Entering edit mode
4.9 years ago
EagleEye 7.2k

You can use table browser.

ADD COMMENT
0
Entering edit mode

thanks! it is a easy way to get it, i never find this way befor!

ADD REPLY

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6