Average PhastCon score for a bed file
2
3
Entering edit mode
7.5 years ago

I have a bed file and wanted to calculated average PhastCon score over the regions, is there a tool which could do so? I tried using cmotifs but it submits jobs and at times does not produce timely results.

phastCon • 4.8k views
10
Entering edit mode
7.5 years ago

You could use bedmap --mean here, to map PhastCon signal onto your sorted regions.

First, grab PhastCon data from UCSC and convert WIG to a compressed form of BED called Starch, using a tool called wig2starch:

$rootPath="http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/hg19.100way.phastCons"$ for i in seq 1 22 X Y; \

do \
echo "getting PhastCon data for chromosome chr${i}..."; \ wigFn="chr${i}.phastCons100way.wigFix"; \
url="${rootPath}/${wigFn}.gz"; \
wget -qO- ${url} | gunzip -c - | wig2starch - >${wigFn}.starch; \
done


We could have used wig2bed instead of wig2starch, but note that the final set of PhastCons files could take between 10-20 GB compressed, and uncompressed BED files could require about 10x more disk space. It takes a little longer to make Starch files, but it will take less disk space. I'd recommend some kind of compression, if you plan to do repeated map calculations.

Second, map PhastCon signal files to your sorted regions file by chromosome:

$for i in seq 1 22 X Y; \ do \ echo "mapping signal for chromosome chr${i}..."; \
phastConFn="chr${i}.phastCons100way.wigFix.starch"; \ bedmap --echo --mean --chrom chr${i} regions.bed ${phastConFn} > regions_with_avg_phastcons.${i}.bed;  \
done


Third (optionally), union the per-chromosome results to a single file, using bedops --everything:

$bedops --everything regions_with_avg_phastcons.*.bed > regions_with_avg_phastcons.bed  Doing the bedmap step with --chrom allows parallelization with grid job schedulers or GNU Parallel, splitting the tasks by chromosome. This could reduce calculation time down to that required for the largest chromosome of regions. One thing I would note is that the average value calculated is over scores for mapped elements. If you have gaps of scores along your mapped region, those gaps would not contribute to the calculation of the mean signal. I do not think you can interpolate PhastCon conservation signal between a gap, but it has been a while since I have worked with this particular dataset - its authors could probably confirm or negate this. It might generally be useful to use bedmap with the --skip-unmapped option to filter out results for unmapped regions. ADD COMMENT 0 Entering edit mode @AlexI I am getting the following error: Error: Error: stat() failed on: regions.bed mapping signal for chromosome chr18... May use bedmap --help for more help.  For all the chromosomes. ADD REPLY 0 Entering edit mode Errors of the type "stat() failed" usually mean that the file does not exist or is otherwise not accessible. I'd double-check your input files are available and accessible. ADD REPLY 0 Entering edit mode Hi Alex the file is present there and I also checked the input format. ADD REPLY 0 Entering edit mode I'm not really sure: there's not much I can troubleshoot as described. Double-check to be sure that the specific inputs exist and are accessible, when the loop fails (chr18?). Make sure you have sufficient disk space and file permissions. Perhaps break down the loop to individual chromosomes, until you find the problem file related to chr18. ADD REPLY 0 Entering edit mode Hi Alex to troubleshoot it I just took a single chromosome file, and executed the following command: bedmap --echo --mean --chrom chr1 regions.bed chr1.phastCons100way.wigFix.starch  May use bedmap --help for more help. Error: Error: stat() failed on: regions.bed  And so to see if bedmap is working at all on the file I did : bedmap --echo regions.bed  Let me try some other combinations Thanks, a very useful code though. ADD REPLY 0 Entering edit mode Is regions.bed sorted per sort-bed? I can't immediately think of any other issues, but that's not to say this isn't a bug. If it helps, if you can post regions.bed somewhere I can download it (Dropbox, Google Drive, etc.), I can offer to try to replicate the error on my end, as the Phastcon data are available from UCSC. ADD REPLY 0 Entering edit mode Sure that would be great, where to can I send a link to download the file? ADD REPLY 0 Entering edit mode See my profile page for email address. ADD REPLY 1 Entering edit mode Or email: reynolda - at - uw.edu ADD REPLY 1 Entering edit mode I did some spot checks against the regions you sent by email, and I could not reproduce your issue. For example: $ bedmap --echo --mean --chrom chr10 regions.bed chr10.phastCons100way.wigFix.starch
chr10   75255759        75255782        -       ENSG00000107758:E1.1|0.999435
$bedmap --echo --mean --chrom chrY regions.bed chrY.phastCons100way.wigFix.starch$


I suggest again double-checking your files, permissions, and so forth. From the error message you reported earlier, it looks like something is wrong specifically at chromosome chr18, so I suggest that you maybe start your investigations there. I'm sorry you're having problems, and I hope you find a solution!

0
Entering edit mode

Thanks for you help Alex.

0
Entering edit mode
7.5 years ago

If the files is small and this is a one time thing, you can extract it from ucsc table browser easily.

Go to ucsc, upload the track, go to table browser, specify the phastcons track you want, click on intersection and select your file, there it is.

If you will be doing this frequently, I would suggest downloading the wig file and writing a script for this.