Get GC content, Conservation score and repeat fraction from UCSC?
2
1
Entering edit mode
4.3 years ago
star ▴ 350

I have a big genomic file including coordinate as below.

Data:

chr10   42383177    42384128
chr10   42384129    42385080
chr10   42385081    42386032
chr10   42386033    42386984
chr10   42386985    42387936
chr10   42387937    42388888
chr10   42388889    42389840
chr10   42389841    42390793

I would like to extract some features from UCSC like:

  • GC content
  • conservation score (phastCons)
  • repeat fraction (RepeatMasker)

is there any way to download the whole genomic table from UCSC or any script to get those scores for my given coordinates? How can I get those features from UCSC?

Many thanks in advance!

UCSC Genomic Integration enrichment • 1.4k views
ADD COMMENT
3
Entering edit mode
4.3 years ago
Luis Nassar ▴ 650

Hello,

These entire data sets are all available from our download server (as well as our API and public mysql for programmatic point access). Here is an example of the data locations for GRCh38/hg38:

I would also like to mention we have a Data Integrator (http://genome.ucsc.edu/cgi-bin/hgIntegrator) tool. If you upload your bed file, you can then select tracks that it will overlap with your positions and annotate according to those tracks. I made a small bed file with your example positon, chose the following tables for hg38:

  • GC Percent
  • Conservation - Cons 100 verts
  • RepeatMaster

Then configured some of the output fields (to reduce redundancy and simplify the example), and get an output as follows:

# hgIntegrator: database=hg38 region=genome Wed Jan 15 09:14:18 2020
#ct_UserTrack_3545.chrom    ct_UserTrack_3545.chromStart    ct_UserTrack_3545.chromEnd  gc5BaseBw.valueAverage  phastCons100way.valueAverage    rmsk.repName    rmsk.repClass   rmsk.repFamily
chr10   42383177    42384128    35.247108       AluSc   SINE    Alu
chr10   42383177    42384128            MER57A-int  LTR ERV1
chr10   42383177    42384128            LTR19A  LTR ERV1
chr10   42384129    42385080    47.024185       LTR19A  LTR ERV1
chr10   42384129    42385080            LTR19-int   LTR ERV1
chr10   42384129    42385080            AluYi6_4d   SINE    Alu
chr10   42384129    42385080            HERVFH19-int    LTR ERV1
chr10   42385081    42386032    44.058885       HERVFH19-int    LTR ERV1
chr10   42385081    42386032            AluSx1  SINE    Alu
chr10   42386033    42386984    44.794953       AluSx1  SINE    Alu
chr10   42386033    42386984            HERVFH19-int    LTR ERV1
chr10   42386985    42387936    43.238696       HERVFH19-int    LTR ERV1
chr10   42386985    42387936            AluSg   SINE    Alu
chr10   42386985    42387936            HERVFH19-int    LTR ERV1
chr10   42387937    42388888    45.467928       HERVFH19-int    LTR ERV1
chr10   42387937    42388888            AluSx1  SINE    Alu
chr10   42387937    42388888            HERVFH19-int    LTR ERV1
chr10   42388889    42389840    41.808623       HERVFH19-int    LTR ERV1
chr10   42389841    42390793    46.491597       HERVFH19-int    LTR ERV1
chr10   42389841    42390793            AluSz   SINE    Alu
chr10   42389841    42390793            HERVFH19-int    LTR ERV1

You'll notice though that this region does not have conservation scores, looks to be due to repeats. If you have further questions you can email us at genome@soe.ucsc.edu.

ADD COMMENT
0
Entering edit mode

@Many Thanks Luis.

I have two more questions:

  • What is the difference between PhastCons100way and PhastConsElements100way? and which one should be used?

  • I could not find 'phastCons100way.valueAverage', there is only phastCons100way.value' that is for 100 specious when I check the output file, is there any option to give me the average of scores?

ADD REPLY
0
Entering edit mode
4.3 years ago

index and query with tabix or just use bedtools intersect

ADD COMMENT
0
Entering edit mode

@ Pierre. sorry I forgot to indicate that I would like to download (GC.conservation and ..) from UCSC. I edited the question.

ADD REPLY

Login before adding your answer.

Traffic: 3008 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6