Not with R, but you could use BEDOPS tools:
Convert the per-chromosome phyloP WIG files on UCSC's servers to
sorted BED using wig2bed.
Prepare your genomic regions into a sorted BED file containing
per-base elements.
Run bedmap on the phyloP data and your regions to map per-base
phyloP scores to your per-base regions.
Retrieving phyloP data and converting it to sorted BED
We can make per-chromosome phyloP BED files and ultimately use these with bedmap or other BEDOPS tools. As an example, we can use the vertebrate hg19 phyloP data located at http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/
$ rsync -avz --progress rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate .
...
$ for fn in `ls vertebrate/*.gz`; \
do gunzip -c $fn \
| wig2bed - \
> ${fn%%.*}.bed; \
done
Preparing your regions-of-interest
Let's assume your regions-of-interest are already in a four-column (or more) BED file called roi.contiguous.unsorted.bed. To make sure it works with BEDOPS correctly, we sort it and create a file called roi.contiguous.bed:
$ sort-bed roi.contiguous.unsorted.bed > roi.contiguous.bed
Next, we want to split the ranges into per-base elements. We assume there is an ID field in the fourth column. We also assume that the ID for a given region-of-interest is uniquely named. When splitting our regions, we can use this ID as a unique prefix, in order to specifically and uniquely identify with which range the per-base element associates:
$ awk ' \
{ \
regionChromosome = $1; \
regionStart = $2; \
regionStop = $3; \
regionID = $4; \
baseIdx = 0; \
for (baseStart = regionStart; baseStart < regionStop; baseStart++) { \
baseStop = baseStart + 1; \
print regionChromosome"\t"baseStart"\t"baseStop"\t"regionID"-"baseIdx; \
baseIdx++; \
} \
}' roi.contiguous.bed > roi.perBase.bed
Run head or less on the roi.perBase.bed file so that you see what this script does. When we map phyloP scores, we will be able to map them base-by-base.
Note: If your BED file does not have an ID field, or region IDs are not unique, it is trivial to use awk to add or replace the ID field with a unique, per-row identifier.
Performing bedmap operations
Finally, we're ready to do the map operation and see which elements associate:
$ for chrLabel in `seq 1 22` X Y; \
do \
mapFn=/path/to/phyloP/chr${chrLabel}.bed; \
bedmap --chrom chr${chrLabel} --echo --echo-map-score roi.perBase.bed $mapFn \
> perBase.chr${chrLabel}.answer.bed; \
done
The file perBase.chr${chrLabel}.answer.bed (where chr${chrLabel} is replaced with the chromosome name from the per-chromosome phyloP files) contains per-base-split regions-of-interest along with the phyloP scores that associate with each base.
To explain the bedmap statement, the --echo operator prints out the region-of-interest (a per-base element from your original regions-of-interest) and --echo-map-score prints out the phyloP score that associates with that base, for each chromosome.
In the case where no phyloP score maps to a base, a NAN is printed.
You may or may not want to use sed or other tools to replace NAN with a zero, particularly if the presence of a zero is a biologically meaningful score, while the absence of a score is also biologically meaningful, but in a different way.
Some more work to do
A couple more things you could do that might make this easier, if you have to automate this process:
Concatenate the per-chromosome phyloP BED files into one master BED
file.
Merge the split elements in the perBase.chr${chrLabel}.answer.bed
result back into contiguous ranges, using the ID prefix to group
them. In a column adjacent to the newly re-merged contiguous range,
print all the scores into a row vector that is delimited with some
useful character.
Doing step 1 would remove the need to use a for loop on the bedmap statement. You then just have one bedmap operation against the master phyloP file, instead of looping through each per-chromosome phyloP file.
Step 2 would let you more easily process contiguous regions. For example, you might be working with a window around a set of transcription start sites and you want to know what the mean conservation is across each window.
Hope this gets you started!
Thanks for sharing details and also for phastcons in a separate post. One correction: I think there should be
--chrom $chroption inbedmap --echo --echo-map-score roi.perBase.bed $fn...else it is echoing all chromosomes from roi.perBase.bed. That is strange given$fnonly has one chromosome and mapping seems to be based on start-stop columns and ignoring chromosome column.You're right that
--chrom $chris needed. The fileroi.perBase.bedis the "reference file" and mapping will be done against that. So even if no overlaps are found in the map file$fn(or nowchr${chrLabel}, to track chromosome names), elements will be reported from all chromosomes of the reference file (unless--skip-unmappedis used, but that is not useful here). Good catch.Maybe you can give some suggestions on a good strategy of assigning the unique per row identifier. Not sure what is most appropriate to this method just given a simple bed file. My main interest is to get the scores from the merged elements which would represent a single score per interval within the bed. My bed file looks as below. Thanks!
In Python:
Then:
The file
out.bedwill be sorted and contain per-base positions within each element.The ID field is effectively unique by using the form
A-B, whereAis the element or row index, andBis the position or base index within the element. I use zero-padding to make the result easier to troubleshoot/read.If all the elements in
in.bedare of the same length, this makes mapping and later aggregation very fast, as you can split the ID field into element index and base position index. The base index can directly reference the contents within an array at a specific position.This just may be my lack on Python understanding but in the initial python script where is the bed file input, that file requiring the unique identifiers? Thank you!
Maybe I misunderstand. I'm assuming you have a BED file that you want to map and aggregate on a per-base level with phyloP conservation scores (or whatever per-base signal, but phyloP for the purposes of this question). This BED file is made up of regions of interest, like TFBS or promoters, etc. What are you trying to do?
That is exactly right, at the end Id like to have a conservation score based on phyloP for each interval in my bed file to sort for downstream analysis. I was just asking about the python script you listed and how I can input this initial bed file into it so as to add unique identifiers to it before proceeding with your initial example in this post. Thanks!
If you need to strip the header, you could do:
Then you'd use
out.bedfor scoring.Thats perfect thank you!
Quick question, tried doing mapping as you suggested with the following code:
And got this error, not sure why its adding answer.bed to the phyloP files in this instance? Thanks!
Can you print a snippet of the results from this?
I suspect you may need to use the second variable (the "basename") in some form, instead of the raw
$fn. I'd troubleshoot there.ok its really strange but I redid the import of mm10 phyloP as follows:
This gave a directory
mm10.60way.phyloP60waywhich contained all of these the necessary .gz files, following this I ran:Which I thought would give me individual .bed files for use with bedmap however it concatenated all of them it seems into one file named mm10.bed where the only chromosome listed is chrY for some odd reason, can you see why this would be? I guess I can wig2bed all of them individually but its just odd.
Here's what I see:
I think you may want a slight tweak:
In other words, use
${fn##*/}and not${fn%%.*}. Maybe prefix${fn##*/}with another directory.Bash variables like these are a huge pain in the ass. So I'd suggest getting into the habit of using
echoto debug and sanity-check what your source and destination paths look like, before running theforloop with real data.This worked perfectly, thanks for tip in terms of checking the bash variables and adjusting them. I sometimes neglect to think of these path issues. In terms of concatenating the per-chromosome phyloP BED files would you suggest just using bedtools merge or something different? Also I was hoping you could give a suggestion on a good way to merge the
perBase.$fn.answer.bedback into continuous ranges, since at the end I would like just a regular interval bed file where every interval has some conservation score that I can thereafter sort by. Thanks again!Let's say you have a per-base BED file that looks like this, where score or signal data are in the fifth column:
You can get contiguous ranges from this via:
You could then use
bedmapto get the mean score or signal over the merged range:The
bedopstool does set operations. Thebedmaptool does map operations. Here, we map signal to genomic intervals, and we do some specified operation on that signal.If you want something other than a mean, there are other signal operations: https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html#score-operations
You could get the median conservation score over the merged range, for instance.
You can also specify multiple signal operations:
This gives you the mean and standard deviation of signal over merged regions.
Using the
--delim '\t'operator puts these results into separate tab-delimited columns, for example, to make it easy to pipe tosortto get a sorted list:Note that I change the result's extension from
bedtotxt, to serve as a reminder that the output is no longersort-bed-sorted, and so I would need to do sorting before use with set operation tools.That works great thanks!
Still having a little trouble establishing the right variable for mapping my data with unique IDs to the phyloP per chromosome files, my data looks as follows:
I run the following commands, with similarly changed variables before using bedmap:
I get this output which appears to be missing the scores entirely, not sure why the
--echo-map-scoreflag does not add scores and only the pipe:Any idea as to why? Thanks again!
It is likely because those positions have no scores associated with them.
If you do something like this:
and if you get no answer back, that means there is no element in
perBase.chr1.answer.bedthat has a score associated with it.If it helps with processing results, you can add the
--skip-unmappedoperator to thebedmapcommand and it will leave out elements in the output where there are no mappings between reference and map elements, i.e.:So two questions, for phyloP to have a score at a particular base on lets say chromosome 1, what criteria needs to be met in terms of level of conservation? Looking this up as well, just surprised that so many of these have nothing associated though when I just scan these in UCSC for many I see at least some conservation - though maybe thats just a bias on my end.
Second question, if I invoke the
--skip-unmappedflag during mapping wouldn't this preclude me from thereafter collapsing into contiguous intervals?Thanks for your help!
I'd take a very careful look at the distribution of your observed phyloP scores before imputing zeros in gaps. If there's a gap where conservation could not be determined over the species alignment, then putting in a zero could say this position is neutral when it simply isn't known. That could be problematic.
That caveat given, here is an ugly way you could use
bedmapto deal with that case:If you're doing column aggregation for elements of equal length, this should give you a score of some kind at every per-base position over your set of elements.
An unreleased version of
bedmapincludes the option to report a zero score where there are unmapped reference elements, but it may be a little longer before the next version of BEDOPS comes out.If you are more comfortable leaving out zeros, you might simply aggregate over the non-NaNs. Say you have two vectors
(2, 10, NAN)and(3, 2, 1). You would take the average of the first two elements, and leave the third element as it is. Not great, but if you have a bunch of such vectors, you could think about calculating a per-base or per-position standard deviation or variance specific to thenat that position, and return this along with the mean or other per-base statistic.In the case of missing data, you might impute a value of zero for neutral conservation or selection. But this really depends on the data, and I don't remember exactly how this works for phyloP. To do imputation you'd need an extra step. I'm on my phone so I'd have to wait until I'm in front of a proper keyboard to describe the specifics.