Question

Double counts with bedtools intersect?

0

Entering edit mode

21 months ago

Zeng Hao ▴ 40

Hi everyone,

I have mapped some PacBio long reads onto an assembled genome and I'm interested in finding out reads that span (encompass) some regions of interest.

To do that, I'm using the following command, which should return reads that cover my ROI coordinates in a .bed file.

bedtools intersect -abam mappedReads.bam -b ROI.bed -F 1.0 -c -bed > ROIreads.bed

What I noticed, however, is that the output always double counts what appears to be a single read; something like this:

contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,

What might be the possible reasons for this output?

One potential source of this problem that I can think of is that I have converted my PacBio subreads data (which comes in .bam and .xml files using samtools instead of the BAM2fastx, which means that it does not reference the native .xml file from the sequencing run. I did not use the recommended BAM2fastx as it was reporting that I have a missing file; but that is a separate issue. However, if this is indeed the problem, please let me know.

Thank you all!

bedtools mapping alignment • 1.4k views

ADD COMMENT • link 20 months ago by Zeng Hao ▴ 40

1

Entering edit mode

Update: The double counting phenomenon is indeed due to an error with the conversion using samtools. Counting with BAM2fastx resolves this issue completely. The .xml file is also unnecessary for the conversion.

ADD REPLY • link 20 months ago by Zeng Hao ▴ 40

score 2 · Accepted Answer · 2024-01-18

Not sure, but hopefully this bedmap/bam2bed combination will help give the answer you're looking for:

bedmap --echo --skip-unmapped --fraction-ref 1 <(bam2bed -R < reads.bam) roi.bed > answer.bed

The bam2bed tool uses samtools to turn the BAM into BED. If you get double-counts via this method, also, then there may be an issue with conversion via samtools, so at the very least it may be a useful diagnostic.

You can also pass a region and a BAM file into samtools to extract reads which overlap the singular region by one or more bases. This is not the 100% threshold you are after, but you might investigate the output to see if you get double reads, so this may be yet another approach to test the fidelity of read conversion via samtools.