Hi everyone,
I have mapped some PacBio long reads onto an assembled genome and I'm interested in finding out reads that span (encompass) some regions of interest.
To do that, I'm using the following command, which should return reads that cover my ROI coordinates in a .bed file.
bedtools intersect -abam mappedReads.bam -b ROI.bed -F 1.0 -c -bed > ROIreads.bed
What I noticed, however, is that the output always double counts what appears to be a single read; something like this:
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
contig1 876600 923835 m64269e_211012_003412/93128595/54266_87870 60 + 869186 924692 0,0,0 1 3156, 0,
What might be the possible reasons for this output?
One potential source of this problem that I can think of is that I have converted my PacBio subreads data (which comes in .bam and .xml files using samtools instead of the BAM2fastx, which means that it does not reference the native .xml file from the sequencing run. I did not use the recommended BAM2fastx as it was reporting that I have a missing file; but that is a separate issue. However, if this is indeed the problem, please let me know.
Thank you all!
Update: The double counting phenomenon is indeed due to an error with the conversion using
samtools. Counting withBAM2fastxresolves this issue completely. The.xmlfile is also unnecessary for the conversion.