Efficiently merge two BAM files while retaining reads from only one file in overlapping regions
1
0
Entering edit mode
2.2 years ago
kiranchari • 0

I have a WGS BAM file that is fairly large (>150GB) and a smaller BAM file (<5GB) with reads in a small 10Mbp region. I want to (efficiently) merge the two BAM files while retaining reads from only the smaller BAM file in the overlapping 10Mbp region.

My current solution is to first use "bedtools intersect" to remove reads overlapping the 10Mbp region from the big bam file, then to merge this new bam file with the smaller bam using samtools merge.

bedtools intersect -abam big.bam -b 10Mbp.bed -v > temp.bam
samtools merge -o output.bam temp.bam small.bam

Is there a more efficient way to do this? The bedtools command alone is taking a long while to run on the 150GB file.

bedtools samtools bam pysam • 1.6k views
ADD COMMENT
1
Entering edit mode
2.2 years ago

i would

bedtools bamtobed -i small.bam | cut -f 1,2,3 | bedtools merge > small.bed
samtools view -M -L small.bed -O BAM -o temp.bam big.bam
samtools index temp.bam # i'm not sure you need this
samtools merge -o output.bam temp.bam small.bam

or (I'm not sure samtools merge uses 'index jumping')

bedtools bamtobed -i small.bam | cut -f 1,2,3 | bedtools merge > small.bed
samtools merge -L small.bed -o output.bam big.bam small.bam

or

bedtools bamtobed -i small.bam | cut -f 1,2,3 | bedtools merge > small.bed

java -jar picard.jar BedToIntervalList \
      I=small.bed \
      O=small.interval_list \
      SD=/path/to/ref.dict

java -jar picard.jar MergeSamFiles \
      INTERVALS=small.interval_list
      I=small.bam \
      I=big.bam \
      O=output.bam
ADD COMMENT
0
Entering edit mode

Thanks for your suggestions. You may have misunderstood my question. I need output.bam to contain all regions in big.bam, not just small.bed. Within the small.bed region, I want reads from small.bam but not big.bam; outside this region I want reads from big.bam.

ADD REPLY
0
Entering edit mode

ah !

bedtools bamtobed -i small.bam | cut -f 1,2,3 | bedtools merge | sort -t $'\t' -k1,1 -k2,2n > small.bed
bedtools complement -i small.bed -g chroms.tsv > complement.bed
samtools view -M -L complement.bed -O BAM -o temp.bam big.bam
samtools merge -o output.bam temp.bam small.bam
ADD REPLY
0
Entering edit mode

Thanks. This ran in about 11 hours. Happy to accept as answer if you would like to edit your original reply above.

ADD REPLY

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6