Question

Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools

2

Entering edit mode

13.7 years ago

User 9996 ▴ 840

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

bedtools bam next-gen sequencing bed • 8.5k views

ADD COMMENT • link updated 13.7 years ago by Alex Reynolds 36k • written 13.7 years ago by User 9996 ▴ 840

score 3 · Answer 1 · 2011-11-09

3

Entering edit mode

13.7 years ago

Aaronquinlan 12k

A memory-efficient way to do this is to use the new -sorted option in intersectBed. This requires your GFF and BAM files to be sorted by chrom then start. For the BAM it would be:

samtools sort mybam.bam mybam.sort

For the GFF it would be:

sort -k1,1 -k4,4n intervals.gff > intervals.sort.gff

The -sorted option in bedtools does not yet support BAM (work in progress), but you could just use bamToBed in a FIFO to get around this:

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb

This will report, for each interval in the GFF, a new line for each overlapping read. If you wanted to denormalize (collapse) this output to one line per interval with a list of read Ids, you could use the groupBy command as Istvan mentions. Something like below will generate a comma separated list of read ids (column 13 of the output) for each interval.

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb | \
      groupBy -g 1,2,3,4,5,6,7,8,9 -c 13 -o collapse

ADD COMMENT • link 13.7 years ago by Aaronquinlan 12k

0

Entering edit mode

Wow, groupBy really needs short-cuts like the -g 1-9.

ADD REPLY • link 13.7 years ago by Aaronquinlan 12k

0

Entering edit mode

Hi Aaron, I noticed that you have answered quite a few questions in this focus area. So, I thought may be I can post my question here. I am interested in extracting the number of paired-end reads aligned to each member of a large set of regions. I have used BEDTools/IntersectBED tool so far; it works well and returns reads overlapping my regions. But I am only interested in their count; and given the fact that I have quite a large number of regions, the output of intersectBED will be pretty large. Is there a way to prompt intersectBED to only return counts?

ADD REPLY • link 11.8 years ago by Noushin N ▴ 620

1

Entering edit mode

Have you tried the -c option? http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html#c-reporting-the-number-of-overlapping-features

ADD REPLY • link 11.8 years ago by Aaronquinlan 12k

0

Entering edit mode

Thank you so much for the very quick reply! I browsed biostars and SeqAnswers since posting the initial question, and came across coverageBED. Do you know the pros/cons of using -c option vs coverageBED? will they return identical results? and is there a way to make them return the number of paired-end reads as opposed to counting each member of pair once? [ presumably, these counts should be ~2x what I am looking for; but just to be safe!]

ADD REPLY • link 11.8 years ago by Noushin N ▴ 620

score 1 · Answer 2 · 2011-11-09

You might look into the bedops and bedmap tools in the open-source BEDOPS suite.

The bedops tool performs many kinds of set operations (including intersection and element-of operations) on UCSC BED-formatted data.

The bedmap tool performs calculations on the signal/score information that are in four- or five-column BED files, which are map-ped to a reference BED file.

You may need to convert intermediate formats to BED to use these tools, but they are designed to be fast, accurate and low-profile.

(Disclaimer: I am one of the authors of this suite, but not of bedops or bedmap.)

score 0 · Answer 3 · 2011-11-09

I believe that the groupBy tool might give you what you need. I would transform to BED and stick the CIGAR attribute into a column, for example as the name.

Then once the intersect is done and the reads are listed next to the exons sort the file by the columns that you want to group on (exon coordinates I guess) and run the groupBy tool with the collapse operator.