Question: Reporting The Bam Reads Overlapping A Set Of Intervals With Bedtools
2
gravatar for User 9996
7.2 years ago by
User 9996790
User 9996790 wrote:

I am trying to use bedtools to pull out the reads falling directly within a set of BED coordinates. While this command does it successfully:

intersectBed -abam mybam.bam -b intervals.gff -wa -wb -f 1 | coverageBed -abam stdin -b intervals.gff

I find that it loses key information that I need. I'd like to get a listing of the BAM reads -- getting at least their ID -- split by exon. In other words, all the read IDs that fall into the first interval in intervals.gff, all the read IDs that fall into the second interval in intervals.gff... ideally, it would also report the CIGAR string for these reads, but I'd settle for just the ID.

Is there a way to report these reads, such that it's easy to tell from the output which set of reads landed in a given interval in the input BED file?

Thanks you.

ADD COMMENTlink written 7.2 years ago by User 9996790
3
gravatar for Aaronquinlan
7.2 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

A memory-efficient way to do this is to use the new -sorted option in intersectBed. This requires your GFF and BAM files to be sorted by chrom then start. For the BAM it would be:

samtools sort mybam.bam mybam.sort

For the GFF it would be:

sort -k1,1 -k4,4n intervals.gff > intervals.sort.gff

The -sorted option in bedtools does not yet support BAM (work in progress), but you could just use bamToBed in a FIFO to get around this:

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb

This will report, for each interval in the GFF, a new line for each overlapping read. If you wanted to denormalize (collapse) this output to one line per interval with a list of read Ids, you could use the groupBy command as Istvan mentions. Something like below will generate a comma separated list of read ids (column 13 of the output) for each interval.

intersectBed -a intervals.gff -b <(bamToBed -i mybam.sort.bam) -sorted -wa -wb | \
      groupBy -g 1,2,3,4,5,6,7,8,9 -c 13 -o collapse
ADD COMMENTlink written 7.2 years ago by Aaronquinlan10k

Wow, groupBy really needs short-cuts like the -g 1-9.

ADD REPLYlink written 7.2 years ago by Aaronquinlan10k

Hi Aaron, I noticed that you have answered quite a few questions in this focus area. So, I thought may be I can post my question here. I am interested in extracting the number of paired-end reads aligned to each member of a large set of regions. I have used BEDTools/IntersectBED tool so far; it works well and returns reads overlapping my regions. But I am only interested in their count; and given the fact that I have quite a large number of regions, the output of intersectBED will be pretty large. Is there a way to prompt intersectBED to only return counts?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Noushin N530
1

Have you tried the -c option? http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html#c-reporting-the-number-of-overlapping-features

ADD REPLYlink written 5.3 years ago by Aaronquinlan10k

Thank you so much for the very quick reply! I browsed biostars and SeqAnswers since posting the initial question, and came across coverageBED. Do you know the pros/cons of using -c option vs coverageBED? will they return identical results? and is there a way to make them return the number of paired-end reads as opposed to counting each member of pair once? [ presumably, these counts should be ~2x what I am looking for; but just to be safe!]

ADD REPLYlink written 5.3 years ago by Noushin N530
1
gravatar for Alex Reynolds
7.2 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

You might look into the bedops and bedmap tools in the open-source BEDOPS suite.

The bedops tool performs many kinds of set operations (including intersection and element-of operations) on UCSC BED-formatted data.

The bedmap tool performs calculations on the signal/score information that are in four- or five-column BED files, which are map-ped to a reference BED file.

You may need to convert intermediate formats to BED to use these tools, but they are designed to be fast, accurate and low-profile.

(Disclaimer: I am one of the authors of this suite, but not of bedops or bedmap.)

ADD COMMENTlink written 7.2 years ago by Alex Reynolds27k
0
gravatar for Istvan Albert
7.2 years ago by
Istvan Albert ♦♦ 78k
University Park, USA
Istvan Albert ♦♦ 78k wrote:

I believe that the groupBy tool might give you what you need. I would transform to BED and stick the CIGAR attribute into a column, for example as the name.

Then once the intersect is done and the reads are listed next to the exons sort the file by the columns that you want to group on (exon coordinates I guess) and run the groupBy tool with the collapse operator.

ADD COMMENTlink written 7.2 years ago by Istvan Albert ♦♦ 78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1232 users visited in the last hour