Mapping bed file to the ends of bed features of another file
1
0
Entering edit mode
9.9 years ago
flyamer ▴ 60

Hi!

I want to do the following thing: I want to map a bed file with my reads to a bed file with my regions of interest, but I want to filter them, so that I obtain only the ones, that map exactly to an end of a ROI. The reads are of varying length, so I can't just make small ROIs and map them so that they are identical (--exact in bedmap, -r 1.0 in bedtools map). How should I do that?

UPDATE

This is what I mean: I want to keep green reads, but filter out red ones, and a ROI is black

bed mapping • 2.4k views
ADD COMMENT
2
Entering edit mode
9.9 years ago

You can use the --fraction-ref <val> operator to specify what fraction of the reference element (read element) must overlap the ROI:

$ bedmap --echo --echo-map --fraction-ref 1 reads.bed roi.bed > readsWithinROIs.bed

Using --fraction-ref 1 means 100% of the read (reference element) must overlap an ROI (map element). The value <val> can be above 0 and less than or equal to 1, depending on how little or how much stringency is needed.

Here, the file readsWithinROIs.bed will contain reads and mapped ROIs, where the read is contained entirely within the ROI.

However, it will be somewhere within the ROI, not necessarily aligned to the ROI's start or end. If the read must line up with the start or end of the ROI, then you have another step to do.

Because the readsWithinROIs.bed file contains the coordinates of each read and its associated ROI (due to using --echo and --echo-map operators), you can use a tool like awk (or Perl, Python, etc.) to filter them where the start or end coordinate of the read matches the start or end coordinate of an ROI.

For simplicity, let's say each read has one and only one ROI associated with it (though you could possibly have more than one ROI, depending on what your ROIs look like - you might have overlapping ROIs - which would require changes to this script).

In this simplified case, you can use awk to test the start and end bounds of each read and whether either of those bounds match with those of the parent ROI - and if there's a match, we print the whole line:

$ bedmap --echo --echo-map --fraction-ref 1 reads.bed roi.bed \
    | awk ' \
        BEGIN { \
            FS = "|"; \
        } \
        { \
            readBed = $1; \
            roiBed = $2; \
            split(readBed, readBedFields, "\t"); \
            split(roiBed, roiBedFields, "\t"); \
            readStart = readBedFields[2]; \
            readStop = readBedFields[3]; \
            roiStart = roiBedFields[2]; \
            roiStop = roiBedFields[3]; \
            if ((readStart == roiStart) || (readStop == roiStop)) { \
                print $0; \
            } \
        }' > readsWithinROIsAlignedToEdges.bed
ADD COMMENT
0
Entering edit mode
Ok, please, wait a couple of minutes, I will make an image of what I mean.
ADD REPLY
0
Entering edit mode

Please, see the updated question.

ADD REPLY
0
Entering edit mode

Please see the update to my answer.

ADD REPLY
0
Entering edit mode

Thank you! Yeah, I think that is exactly what I need (only I will use --fraction-map, not ref...). I will probably rewrite your awk script in Python and post it here, because I don't know awk so that I can better control my pipeline.

ADD REPLY
0
Entering edit mode

If you use --fraction-map, you'd reverse the order of ROI and read files:

$ bedmap --echo --echo-map --fraction-map 1 roi.bed reads.bed | ...

Note that, in this case, you will almost certainly have multiple reads mapped to each ROI. Which is not a problem, except that you'll need to do two split operations on each line of this output:

  1. Split on the pipe character (|) to split the ROI element from the reads; and,
  2. Split the reads on the semi-colon character (;) to get to each read element.

If you do it the other way around, by mapping reads to ROIs, then you can probably test each read directly, without having to do two split operations. Either way is fine, one way might be a little more work.

ADD REPLY
0
Entering edit mode

OK, thank you for this comment, you are probably right in general. I will give it a little more thought though in the context of future steps of my analysis, my way may prove to be better. I need to count these reads for each ROI, so if I use --fraction-map and simultaneously filter and count reads for each ROI, that should be significantly faster, than analysing the output of --fraction-ref.

ADD REPLY
0
Entering edit mode

If you need to count filtered reads per ROI, then your way will be easier to do at the same time as you parse the output, definitely.

ADD REPLY

Login before adding your answer.

Traffic: 2144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6