How to I automate getting a conservation score from UCSC if I have exon coordinates?
1
0
Entering edit mode
8.5 years ago

I have a list of exon coordinates and I need the per base conservation (phastcons100ways) from the UCSC Table browser. I know how to do it using GUI but I'm trying to automate it either by command line or scripting it because the job at hand is huge and the GUI restricts it to a 1000 defined regions

genome alignment ChIP-Seq • 2.0k views
ADD COMMENT
1
Entering edit mode
8.5 years ago

Use UCSC GoldenPath to download wig files, converting them to BED with BEDOPS wig2bed.

Here's an example of how to get phyloP 46-way scores for hg19:

$ cd work_directory
$ rsync -avz --progress rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate .    
...
$ for fn in `ls vertebrate/*.gz`; \
    do gunzip -c $fn \
        | wig2bed - \
        > ${fn%%.*}.bed; \
    done

You can modify this for other conservation score collections and reference genomes, as needed.

Once you have BED files containing conservation signal, you can use BEDOPS bedmap to restrict the score set to your exon coordinates.

First, sort your exon coordinates:

$ sort-bed exons.unsorted.bed > exons.bed

Then map exons to conservation scores, e.g. for chrX:

$ bedmap --echo --echo-map-score --delim '\t' exons.bed chrX.phyloP46way.bed > answer.chrX.bed

To do this map step over all chromosomes:

$ for chr in `seq 1 22` X Y; do bedmap --echo --echo-map-score --delim '\t' exons.bed chr${chr}.phyloP46way.bed > answer.chr${chr}.bed; done

This loop could be parallelized with GNU Parallel or a computation cluster. In any case, doing all this on the command-line gets around browser limitations.

ADD COMMENT

Login before adding your answer.

Traffic: 5261 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6