Question: Best Way To Merge A Many Thousand Small Bam Files Into One Big Bam File?
18
gravatar for 2184687-1231-83-
8.0 years ago by
2184687-1231-83-4.9k wrote:

I've got a few thousand small bam files produced against the exact same reference, and I want to merge them into one single big bam file. What is the best way to do that?

Should I do this iteratively or can I pass a long list of bam files to samtools/picard/etc in one go?

Edited, since this is now partially solved. In my terminal, the methods below works for up to 4092 files. More than that raises an error:

samtools merge all.bam *.bam
samtools merge all.bam `find /basedir/ -name "*myfiles*.bam"`
samtools merge all.bam /basedir/*/???/*myfiles*.bam
merge picard samtools bam • 52k views
ADD COMMENTlink modified 5 months ago by mmfansler310 • written 8.0 years ago by 2184687-1231-83-4.9k

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLYlink written 3.0 years ago by amazingworldpr0
14
gravatar for Erik Garrison
6.2 years ago by
Erik Garrison2.1k
Somerville, MA
Erik Garrison2.1k wrote:

Bamtools handles this very cleanly. Note that this properly constructs the header, which might matter if you have different read groups in each file.

bamtools merge -list files.bamlist -out merged.bam

You can also supply a region via the -region flag, using the format 12:2232..3328.

ADD COMMENTlink written 6.2 years ago by Erik Garrison2.1k
1

That one worked perfectly. It just included all @RG lines with their respective IDs. No downstream errors (at least for now)

ADD REPLYlink written 5.7 years ago by mazzottidr30
1

which one?? That one worked perfectly??

ADD REPLYlink written 3.3 years ago by Shicheng Guo7.5k

Hi!

I have 950 bam files to merge, so your solution interested me very much! But my bamtools merge program does not have a -list option. Which version of bamtools do you have ?

Thank you

Maria

 

ADD REPLYlink written 5.0 years ago by maria.bernard13010

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLYlink written 3.0 years ago by amazingworldpr0
13
gravatar for Stefano Berri
8.0 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Both are possible, but the second is probably best and easier

Put all your bam file in a folder (or create a folder with symbolic links to all sam files you want merge) and then

samtools merge finalBamFile.bam *.bam

see here http://samtools.sourceforge.net/samtools.shtml for options about samtools (header and so on)

Editing to address the issue with many file.

You can take a three steps approach.

First you print the header

 samtools view -H bar/foo/anyFile.bam > allSeq.sam

then you append sequences from all files using find

 find ./ -name \*.bam -exec samtools view {} \; >> allSeq.sam

and finally convert to bam

samtools view -bh allSeq.sam > allSeq.bam

I can't quite test it at the moment, but should work

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Stefano Berri4.1k
4

man xargs

That should get you around the 4092 files problem (which will be a command line length limit in your shell, if I understand things correctly)

ADD REPLYlink written 7.9 years ago by Alaw90
2

But make sure to use backticks around the find statement. They got scrubbed from the comment for some reason

ADD REPLYlink written 8.0 years ago by Docroberson280
1

Perhaps more like this: samtools merge all.bam find /mydirs/ -name *.bam

ADD REPLYlink written 8.0 years ago by Docroberson280

Thanks very much. I don't have them on the same dir, but this works: ~/samtools merge all.bam ~/mydirs/??/??/??/mybam.*.bam

ADD REPLYlink written 8.0 years ago by 2184687-1231-83-4.9k

@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp, at least if it's a few thousand files.

ADD REPLYlink written 8.0 years ago by 2184687-1231-83-4.9k

@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp. It works for up to 4092 files in my terminal.

ADD REPLYlink written 8.0 years ago by 2184687-1231-83-4.9k

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLYlink written 3.0 years ago by amazingworldpr0
9
gravatar for Ryan Thompson
7.9 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

This should work on an arbitrary number of bam files (i.e. more than 4096).

find $BAM_DIR -name '*.bam' | {
    read firstbam
    samtools view -h "$firstbam"
    while read bam; do
        samtools view "$bam"
    done
} | samtools view -ubS - | samtools sort - merged
samtools index merged.bam
ls -l merged.bam merged.bam.bai
ADD COMMENTlink written 7.9 years ago by Ryan Thompson3.4k
1

This is the best technique... samtools merge is wonky/segfaults sometimes.

ADD REPLYlink written 6.6 years ago by earonesty220
1

the most amazing bash script I have ever seen. It's amazing

ADD REPLYlink written 3.5 years ago by Shicheng Guo7.5k
1

Line 4: 'while read bam' can you interpret what's 'bam' they are in this context?

ADD REPLYlink written 3.3 years ago by Shicheng Guo7.5k

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLYlink written 3.0 years ago by amazingworldpr0
6
gravatar for Frenkiboy
8.0 years ago by
Frenkiboy240
Frenkiboy240 wrote:

If you do not need for the final bam file to be sorted, the fastest way would be samtools cat:

samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [...]
ADD COMMENTlink written 8.0 years ago by Frenkiboy240
1

What's the role of header.sam here? how to creat this sam file? 

ADD REPLYlink written 3.5 years ago by Shicheng Guo7.5k

It concatenates the SAM header to the reads. You can create the header with samtools view -h

ADD REPLYlink written 3 months ago by Kevin Blighe43k
1
gravatar for Shicheng Guo
3.3 years ago by
Shicheng Guo7.5k
Shicheng Guo7.5k wrote:

Share a perl script to merge separate Bams by chrosomes (chr1-chr2,chrX,chrY,chrM)

#!/usr/bin/perl
# merge bam file by different chrosome. 
# Bam File: /home/sguo/bam 
# OUT Dir:  /home/sguo/mergeBam
use strict;
use Cwd;
my $bamdir="/home/sguo/bam";
chdir $bamdir;
my @file=glob("*bam");
my $outdir="/home/sguo/mergeBam";
my %sam;
foreach my $file(@file){
        my ($sam,undef)=split /\./,$file;
        $sam{$sam}=$sam;
}
foreach my $sam(sort keys %sam){
    open OUT, ">$outdir/$sam.bam.merge.sh";
    print OUT "cd $bamdir\n";
    my @file=glob("$sam*bam");
    my $file=join(" ",@file);
    print "$sam.bam.merge.sh\n";
    print OUT "samtools view -H $file[0]>$outdir/$sam.header\n";
    print OUT "samtools cat -h $outdir/$sam.header -o $outdir/$sam.bam $file\n";
    print OUT "samtools index $outdir/$sam.bam\n";
    close OUT ;         
    system("sh $outdir/$sam.bam.merge.sh &");   
}
ADD COMMENTlink written 3.3 years ago by Shicheng Guo7.5k
3

a perl script generating a bash script calling a C program ?

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum120k
1

I still pretty much bank on simple looping in bash or shell rather than create an entire perl script for doing such purpose, if a bash script can directly read all my file at one go and merge them, although yes the computational expense is something am not considering here as per the memory and the time usage, but I believe it should be less.

ADD REPLYlink written 3.3 years ago by ivivek_ngs4.8k
1

I'd like Ryan Thompson's bash script, but, It reports some warnings for my case. I don't know why? Anybody meet any warning?

ADD REPLYlink written 3.3 years ago by Shicheng Guo7.5k
1

It would help to know what the warnings were ...

ADD REPLYlink written 3.3 years ago by george.ry1.1k

enter image description here

ADD REPLYlink written 3.3 years ago by Shicheng Guo7.5k
1

The post that you copied is >4 years old, so it's using the syntax from samtools 0.1. The latest versions of samtools use a different syntax for samtools sort ... which is exactly what the error says.

Change

samtools sort - merged

to

samtools sort -o merged.bam -

You will probably also want to include some multithreading (-@), so read that usage information carefully.

ADD REPLYlink written 3.3 years ago by george.ry1.1k

I notice it once I post it. shame on myself...

ADD REPLYlink written 3.3 years ago by Shicheng Guo7.5k

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLYlink written 3.0 years ago by amazingworldpr0

If those are pieces from the same bam file that are created during sorting then they should be automatically deleted once the sort process is complete. If they are still around then it is likely that your sorting did not complete properly. You can delete the pieces and then redo the sorting.

ADD REPLYlink written 3.0 years ago by genomax68k
1
gravatar for mmfansler
5 months ago by
mmfansler310
MSKCC | New York, NY
mmfansler310 wrote:

Two-Stage Multithreaded Version for Sorted BAMs

While this thread already has some great answers, I wanted to suggest a parallelized version that is robust to open file limits (e.g., > 4096 files). This requires GNU parallel.

Code

find $BAM_DIR -name '*.bam' |
  parallel -j8 -N4095 -m --files samtools merge -u - |
  parallel --xargs samtools merge -@8 merged.bam {}";" rm {}

Overview

This will take all BAM files in $BAM_DIR and run eight (-j8) separate single-threaded merge operations, with the input files (mostly) equally distributed among the different jobs. This results in temporary files which are then merged into merged.bam in a multithreaded operation. The temporary files are deleted at the end.

Options

One need not keep the number of simultaneous merge operations in the first round of merging (-j8) in correspondence with the number of threads used for the second round (-@8). It's likely the first round will be bottlenecked by too much simultaneous writing, so you may want to keep that lower.

Use the -N flag to change the maximum number of arguments to be given to each first round merge operation. Here 4095 is just the common open files limit minus one (for the output file).

The -u flag is there so the temporary files will be uncompressed, since we're deleting them in the end. That can be removed if you have concerns about storage space for the temp files.

ADD COMMENTlink written 5 months ago by mmfansler310
2

Really elegant :) Still, as the temporary files that are created (*.par) will be stored in memory (/tmp) running this on many files might create memory issues, especially as the intermediates are uncompressed. I did a quick test with 5000 files, 64000 reads (50bp) each and this gave about 3Gb per intermediate file after the first merge, so this will for sure overload the memory once the files get larger. It might be better to store the intermediate files after the first merge to disk to reduce the number of files and then run the second merge separately.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint17k
1

Yes, good point! That is effectively what I do in practice. GNU parallel respects the TMPDIR environment variable, so if you have that specified to a local scratch disk, that's where the intermediates will go.

ADD REPLYlink written 5 months ago by mmfansler310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 563 users visited in the last hour