Question: Getting All Reads That Align To A Region In Compact Bed Format Using Bedtools?
gravatar for user
7.9 years ago by
United States
user870 wrote:

I'm trying to find all the reads (by name) from a BAM file that align to various regions in a bed file. Right now I can do this with bedtools using intersectBed:

intersectBed -abam reads.bam -wo -f 1 -b regions.bed -bed

From this one can parse all the read ids that land in every interval in regions.bed, but it's not very compact. Is there a way to get bedtools to natively transform this into a more compact format, e.g.

chr1 x y .... read_id1,read_id2,read_id3

where chr1 x y is a given interval in regions.bed and the comma separated read_id1,... is the list of read ids from reads.bam that fall in that interval. In this compact format, the output BED file would have at most as many entries as there are regions in regions.bed, whereas with the -wo option it can be even larger than the number of reads in reads.bam. Thanks.

bedtools bed gff bam rna-seq • 3.4k views
ADD COMMENTlink modified 7.9 years ago by Alex Reynolds31k • written 7.9 years ago by user870
gravatar for Aaronquinlan
7.9 years ago by
United States
Aaronquinlan11k wrote:

One way to do this is to turn things around a bit, by making your BAM file be the "B" input. This way, the "driving" "A" file will be your regions.bed. This example below requires that your BAM file is sorted by position. Now, intersect won't allow BAM as the B file directly (it will soon), but you can use bedtobam to convert it to BED and pipe this to intersect. Using the -sorted option to limit memory consumption (and requiring that both the BED and the BAM are position sorted; sort -k1,1 -k2,2n for your BED file), you can pipe the output to the groupby option to get the result you'd like. The collapse option in groupby takes all values from a specific column (in this case the read ID is the 8th column) and creates a comma separated list from them. The list is returned based on changes in the columns upon which you are grouping --- in this case, columns 1-4 (-g 1-4), which represent the columns in my example regions file. Below is an example with some test files I have. If you are unfamiliar with groupby, there are more details here.

$  bedtools bamtobed -i testingData/NA18152.bam | head -3
chr1    554304    554637    NA18152-SRR007381.35051    16    -
chr1    554304    554618    NA18152-SRR007381.637219    16    -
chr1    554304    554656    NA18152-SRR007381.730912    16    -

Intersect the regions with the BAM file (NOTE: I have annotated the output to show where each output group starts). In this example, the BAM read name is the 8th column and the overlap (-wo) is the 11th column.

$ bedtools bamtobed -i NA18152.sorted.bam | \
   bedtools intersect -a regions.bed -b -  -sorted -wo
# group 1
chr1    713984    714547    CpG:_60    chr1    714220    714373    NA18152-SRR007381.251923    31    +    153
chr1    713984    714547    CpG:_60    chr1    714220    714368    NA18152-SRR007381.831825    37    +    148
# group 2
chr1    858970    861632    CpG:_257    chr1    860064    860310    NA18152-SRR007381.329161    10    -    246
# group 3
chr1    875730    878363    CpG:_246    chr1    875876    876203    NA18152-SRR007381.732122    9    +    327
# group 4
chr1    933387    937410    CpG:_413    chr1    936966    937135    NA18152-SRR007381.925947    12    -    169
# group 5
chr1    1109314    1110145    CpG:_59    chr1    1108986    1109385    NA18152-SRR007381.659411    23    +    71
chr1    1109314    1110145    CpG:_59    chr1    1108992    1109433    NA18152-SRR007381.615088    38    +    119
chr1    1109314    1110145    CpG:_59    chr1    1108992    1109347    NA18152-SRR007381.677905    43    +    33
chr1    1109314    1110145    CpG:_59    chr1    1109029    1109424    NA18152-SRR007381.217465    44    -    110
chr1    1109314    1110145    CpG:_59    chr1    1109038    1109346    NA18152-SRR007381.1495841    46    +    32

Now use group by to condense the read names and overlap columns

$ bedtools bamtobed -i NA18152.sorted.bam | \
   bedtools intersect -a regions.bed -b -  -sorted | \
      bedtools groupby -g 1-4 -c 8,11 -o collapse,collapse \
          head -5
chr1    713984    714547    CpG:_60    NA18152-SRR007381.251923,NA18152-SRR007381.831825    153,148
chr1    858970    861632    CpG:_257    NA18152-SRR007381.329161    246
chr1    875730    878363    CpG:_246    NA18152-SRR007381.732122    327
chr1    933387    937410    CpG:_413    NA18152-SRR007381.925947    169
chr1    1109314    1110145    CpG:_59    NA18152-SRR007381.659411,NA18152-SRR007381.615088,NA18152-SRR007381.677905,NA18152-SRR007381.217465,NA18152-SRR007381.1495841,NA18152-SRR007381.1061710,NA18152-SRR007381.209479,NA18152-SRR007381.293009,NA18152-SRR007381.350684,NA18152-SRR007381.1386200,NA18152-SRR007381.176689,NA18152-SRR007381.369558,NA18152-SRR007381.744765,NA18152-SRR007381.1321023,NA18152-SRR007381.838421,NA18152-SRR007381.991283,NA18152-SRR007381.1385310,NA18152-SRR007381.1039387,NA18152-SRR007381.1380869,NA18152-SRR007381.551813,NA18152-SRR007381.1295807,NA18152-SRR007381.1443162,NA18152-SRR007381.403094,NA18152-SRR007381.130789,NA18152-SRR007381.448068,NA18152-SRR007381.678372,NA18152-SRR007381.1300780,NA18152-SRR007381.160158,NA18152-SRR007381.1454803,NA18152-SRR007381.467939,NA18152-SRR007381.1405856,NA18152-SRR007381.1057252,NA18152-SRR007381.1062561,NA18152-SRR007381.85329,NA18152-SRR007381.722618,NA18152-SRR007381.878135    71,119,33,110,32,131,107,142,374,374,454,314,355,281,331,268,232,226,118,272,280,272,274,157,160,137,202,62,62,54,54,42,25,23,10,10
ADD COMMENTlink modified 11 months ago by RamRS30k • written 7.9 years ago by Aaronquinlan11k

@Aaronquinlan: I believe the above example won't work if you don't pass intersectBed the -wao option - the command as you wrote it will output a BED that has only 4 columns and doesn't output the read id information which groupBy relies on (i.e. the -c 8 parameter). Also on my end, if I pass -wao, the read column id is actually the 9th column (1-based) so it needs to be -c 9 (c-8 will give the read start position). Perhaps I misunderstood your answer?

ADD REPLYlink modified 7.9 years ago • written 7.9 years ago by user870

I've updated the example to hopefully make it more clear to you what is happening. The example also now uses -wo.

ADD REPLYlink written 7.8 years ago by Aaronquinlan11k

Thanks! Final issue: if I pass intersectBed -split, it will split junction reads into distinct intervals. My goal is to look for reads that are in total containment within the region (which is not the same as splitting them). Is there a way to achieve this for junction reads as well? So if you have (a, d) in the intervals file regions.bed and you have a junction read spanning the two exons (a, b), (c, d) (where a < b < c < d), then I would like to count that junction read as strictly contained within (a, d) but not count it towards (a, b) or (c, d) if those intervals are in regions.bed. Edit: actually, I think -split takes care of this, perhaps I'm wrong.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by user870
gravatar for Alex Reynolds
7.9 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

The BEDOPS bedmap tool with the --echo, --echo-map-id, --delim and --multidelim options can also answer this very quickly. You can pipe in reads into bedmap by using the bam2bed script:

$ bam2bed.csh < reads.bam \
    | bedmap --echo --echo-map-id --delim '\t' --multidelim ',' sorted-regions.bed - \
    > answer.bed

The --echo option returns the region element, while --echo-map-id returns a list of read elements which overlap the region by one or more bases.

The answer.bed file will look like:

[ region-A in BED format ] \t [ comma-delimited list of read IDs overlapping region-A ]
[ region-B in BED format ] \t [ comma-delimited list of read IDs overlapping region-B ] 

As the answer is BED-formatted output, you can very easily pipe this to downstream processes that consume BED data.

Just be sure to sort regions, if not sorted, e.g.:

$ sort-bed regions.bed > sorted-regions.bed

And, likewise, for reads:

$ bam2bed.csh < reads.bam | sort-bed - > sorted-reads.bed

Or to put it together with the overall analysis:

$ bam2bed.csh < reads.bam \
    | sort-bed - \
    | bedmap --echo --echo-map-id --delim '\t' --multidelim ',' sorted-regions.bed - \
    > answer.bed
ADD COMMENTlink modified 11 months ago by RamRS30k • written 7.9 years ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour