You can use the --fraction-ref <val>
operator to specify what fraction of the reference element (read element) must overlap the ROI:
$ bedmap --echo --echo-map --fraction-ref 1 reads.bed roi.bed > readsWithinROIs.bed
Using --fraction-ref 1
means 100% of the read (reference element) must overlap an ROI (map element). The value <val>
can be above 0 and less than or equal to 1, depending on how little or how much stringency is needed.
Here, the file readsWithinROIs.bed
will contain reads and mapped ROIs, where the read is contained entirely within the ROI.
However, it will be somewhere within the ROI, not necessarily aligned to the ROI's start or end. If the read must line up with the start or end of the ROI, then you have another step to do.
Because the readsWithinROIs.bed
file contains the coordinates of each read and its associated ROI (due to using --echo
and --echo-map
operators), you can use a tool like awk
(or Perl, Python, etc.) to filter them where the start or end coordinate of the read matches the start or end coordinate of an ROI.
For simplicity, let's say each read has one and only one ROI associated with it (though you could possibly have more than one ROI, depending on what your ROIs look like - you might have overlapping ROIs - which would require changes to this script).
In this simplified case, you can use awk
to test the start and end bounds of each read and whether either of those bounds match with those of the parent ROI - and if there's a match, we print the whole line:
$ bedmap --echo --echo-map --fraction-ref 1 reads.bed roi.bed \
| awk ' \
BEGIN { \
FS = "|"; \
} \
{ \
readBed = $1; \
roiBed = $2; \
split(readBed, readBedFields, "\t"); \
split(roiBed, roiBedFields, "\t"); \
readStart = readBedFields[2]; \
readStop = readBedFields[3]; \
roiStart = roiBedFields[2]; \
roiStop = roiBedFields[3]; \
if ((readStart == roiStart) || (readStop == roiStop)) { \
print $0; \
} \
}' > readsWithinROIsAlignedToEdges.bed
Please, see the updated question.
Please see the update to my answer.
Thank you! Yeah, I think that is exactly what I need (only I will use --fraction-map, not ref...). I will probably rewrite your awk script in Python and post it here, because I don't know awk so that I can better control my pipeline.
If you use
--fraction-map
, you'd reverse the order of ROI and read files:Note that, in this case, you will almost certainly have multiple reads mapped to each ROI. Which is not a problem, except that you'll need to do two split operations on each line of this output:
|
) to split the ROI element from the reads; and,;
) to get to each read element.If you do it the other way around, by mapping reads to ROIs, then you can probably test each read directly, without having to do two split operations. Either way is fine, one way might be a little more work.
OK, thank you for this comment, you are probably right in general. I will give it a little more thought though in the context of future steps of my analysis, my way may prove to be better. I need to count these reads for each ROI, so if I use
--fraction-map
and simultaneously filter and count reads for each ROI, that should be significantly faster, than analysing the output of--fraction-ref
.If you need to count filtered reads per ROI, then your way will be easier to do at the same time as you parse the output, definitely.