Question: Find bams with identical names and merge them
2
gravatar for brendes
14 days ago by
brendes20
Gloucester, MA
brendes20 wrote:

I'm trying to merge different bam files with the same name (and, theoretically, the same sequence). They are the same set of libraries sequenced at two different times, and I want to merge the matching pairs in order to simulate greater sequencing depth.

I can do this just fine with two bams using picard MergeSameFiles or samtools merge, but the issue is that I have 96 bams in each folder. I'd like to do this programatically, not manually, and be able to reproduce the process with different datasets in the future.

My hunch says that the simplest way to do this would be to use the bam filenames: loop over my two folders, find the bams that share a filename, and merge each pair into a single output file, but my shell chops are still rough and I am hitting a wall.

I've been starting with getting a list of bams of interest (I've also attempted dumping this list to a file with -fprint0):

find_bams() {
    find "$run_folder" -type f -name "*.bam"
 }

Then, what I think I should do is loop over the list:

for i in $(find_bams)
    do
        s=$(basename "$i" .bam)
        picard MergeSamFiles I="$i" O="$s".bam
    done

This is where I get stuck. First, there needs to be two input files to merge, and second, those two inputs must have matching filenames.

This is more of a shell scripting problem than a bioinformatics software problem, but I imagine I'm not the only one who has had to do this. Any help would be greatly appreciated.

Edit: Solved with the help of h.mon and finswimmer; see comments

bam sam samtools picard shell • 168 views
ADD COMMENTlink modified 12 days ago by cpad011210k • written 14 days ago by brendes20
1

People probably don't want to write this for you...why don't you share what you have already, and why it's not working?

ADD REPLYlink written 14 days ago by swbarnes24.7k

Definitely not my intention to have people write it for me. I updated the post with what I've got. Thanks for following up.

ADD REPLYlink written 14 days ago by brendes20
2
gravatar for h.mon
13 days ago by
h.mon22k
Brazil
h.mon22k wrote:

Unless you want to a write a more general script to tackle the problem in a more robust fashion, I would try a simpler approach. If the sets of bams have exactly the same names (as you said they have), I would do something like:

cd /path/to/bams1
for i in *.bam
do
    picard MergeSamFiles I=/path/to/bams1/"$i" I=/path/to/bams2/"$i" O=/path/to/merged/"$i"
done

If you keep the same structure of folders and filenames for future datasets, this simple solution will work just fine.

ADD COMMENTlink written 13 days ago by h.mon22k
1

I adapted this to work for my situation:

folder1=$1
folder2=$2

# since i expect both folders to have the same filenames, 
# checking just first folder should be fine
find_ds_bams () {
    find "$folder1"/*/full -type f -name "*.bam" -exec basename {} \;
}

for i in $(find_ds_bams)
do
    s=$(basename "$i" .bam)
    $picardcmd MergeSamFiles \
        I="$(readlink -f $folder1)/$s/full/$i" \
        I="$(readlink -f $folder2)/$s/full/$i" \
        O="./merged/$s.merged.bam"
done

I must have been thinking too much. Thanks a lot!

ADD REPLYlink written 13 days ago by brendes20
2
gravatar for finswimmer
13 days ago by
finswimmer8.9k
Germany
finswimmer8.9k wrote:

You can try this:

$ find /path/to/run_folder/ -type f -name "*.bam" \
| awk '{n=split($0, filename, "/"); input[filename[n]] = input[filename[n]]$0" "} END { for(file in input) {gsub(/ $/, "", input[file]); print file" "input[file]}}' \
| parallel --colsep " " 'samtools merge {}'

fin swimmer

ADD COMMENTlink written 13 days ago by finswimmer8.9k

Thank you for your response. I found a working solution via the shell, but will explore awk further.

ADD REPLYlink modified 13 days ago • written 13 days ago by brendes20
0
gravatar for cpad0112
12 days ago by
cpad011210k
India
cpad011210k wrote:
 $ ls test1/*.bam
test1/bam1.bam  test1/bam2.bam  test1/bam3.bam  test1/bam4.bam  test1/bam8.bam

 $ ls test2/*.bam
test2/bam1.bam  test2/bam2.bam  test2/bam3.bam  test2/bam4.bam  test2/bam5.bam

 $ find test1 test2 -type f  -name '*.bam' | xargs basename -a | sort | uniq -d| parallel --dry-run  picard MergeSamFiles I=test1/{} I=test2/{} O={.}.bam 

picard MergeSamFiles I=test1/bam1.bam I=test2/bam1.bam O=bam1.bam
picard MergeSamFiles I=test1/bam2.bam I=test2/bam2.bam O=bam2.bam
picard MergeSamFiles I=test1/bam3.bam I=test2/bam3.bam O=bam3.bam
picard MergeSamFiles I=test1/bam4.bam I=test2/bam4.bam O=bam4.bam
ADD COMMENTlink modified 12 days ago • written 12 days ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1119 users visited in the last hour