Question

Samtools merge BAM files using snakemake

1

Entering edit mode

3.7 years ago

User000 ▴ 690

Hello,

I have two questions regarding merging the BAM files. I have several libraries 4 lanes each. I demultiplexed, mapped against the reference genome, gave the read group name. I am using snakemake and samtools merge.

These are my input files:

Bob_1_CL1.bam
Bob_2_CL1.bam
Bob_3_CL1.bam
Bob_4_CL1.bam
John_1_CL40.bam
John_2_CL40.bam
John_3_CL40.bam
John_4_CL40.bam

This is the code I am trying:

library=["Bob", "John"]
(LIBRARIES, SAMPLES)=glob_wildcards("path/to/{library}_1_{sample}.bam")

rule all:
    input: 
        expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES)

rule samtools merge:
    input:
        lane1="path/to/{library}_1_{sample}.bam",
        lane2="path/to/{library}_2_{sample}.bam",
        lane3="path/to/{library}_3_{sample}.bam",
        lane4="path/to/{library}_4_{sample}.bam"
    output:
        outf = "{library}_{sample}.merged.bam"
    threads: 4
    shell:
        """/Tools/samtools-1.10/samtools merge -@ {threads} {input.lane1} {input.lane2} {input.lane3} {input.lane4} -o {output.outf}"""

I am trying to merge 4 lanes in one single BAM file in order to get Bob_CL1.merged.bam and John_CL40.merged.bam, the problem is that CL number (NOTE: it is not always CL) is different for different libraries, i.e. CL1 is present in Bob, but not in John, and CL40 is for John but not in Bob. so it is giving me an error:

Missing input files for rule merge:
John_1_CL1.bam
John_2_CL1.bam
John_3_CL1.bam
John_4_CL1.bam

So my questions are:

How to solve this?
Do I have to merge the lanes in one single BAM file prior to marking duplicates?

samtools snakemake bam • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Devon Ryan 104k • written 3.7 years ago by User000 ▴ 690

score 2 · Accepted Answer · 2020-08-13

2

Entering edit mode

3.7 years ago

Devon Ryan 104k

If you print the output of expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES) you'll see the problem. Replace expand() with something like ['{}_{}.merged.bam'.format(x, y) for x, y in zip(LIBRARIES, SAMPLES)]
Yes, PCR duplicates are library-level events, so you need to process the entire library at once.

ADD COMMENT • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

thanks a lot. I tried with your solution, but this is changing the positions of the file names. Input: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 to Bob_Smith_1_1, Bob_Smith_2_1,Bob_Smith_3_1,Bob_Smith_4_1 and consequently does not find the file names.

ADD REPLY • link 3.7 years ago by User000 ▴ 690

0

Entering edit mode

Your input now looks nothing like the example you gave in the original post. Where is Smith coming from? Print out LIBRARIES and SAMPLES to debug this.

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

this are the parents of the RIL populations that were sequenced as well. I have mentioned above that it is not always CL.

ADD REPLY • link 3.7 years ago by User000 ▴ 690

0

Entering edit mode

Please post the contents of LIBRARIES and SAMPLES.

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

They are a lot. Apart from 4 parents which has this format: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 I have CL from 1 to 400. I hope this helps, because I do not know if I can post the contents here.

ADD REPLY • link 3.7 years ago by User000 ▴ 690

0

Entering edit mode

That shouldn't be either LIBRARIES or SAMPLES. We can't help you further if you can't provide debugging output.

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry, you are right. I printed a small part of it, should be something like this:

['Bob', 'John', 'Arnold']    
['CL185', 'CL203', 'CL192', 'CL211', 'Cap_2', 'CL472', 'CL236', 'CL269', 'CL222', 'CL216', 'CL65', 'CL235', 'CL350', 'CL306', 'CL442', 'CL430', 'CL201', 'Smith_4', 'CL459', 'CL177', 'CL478', 'CL171', 'CL94', 'CL274', 'James_3', 'CL398', 'CL50', 'CL1', 'CL248', 'CL389', 'CL254', 'CL301', 'CL394', 'CL411', 'CL133', 'CL265', 'CL377', 'CL368', 'CL69', 'CL386', 'CL104', 'James_4', 'Smith_2']

ADD REPLY • link 3.7 years ago by User000 ▴ 690

1

Entering edit mode

Since there isn't a 1:1 relationship between the LIBRARIES and SAMPLES you'll either need to use something other than glob_wildcards or modify the results such that the number of libraries matches the number of samples (you'd repeat the libraries so the lists make sense together).

You may want to restrict the pattern matching in outf so that LIBRARIES can only match Bob, John, and Arnold. That should solve the remaining issues.

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

Actually, the number of libraries match the number of samples, sorry, my bad. Regarding the second part of your answer, could you give more details please? May be a link where I can find more information? Thanks

ADD REPLY • link 3.7 years ago by User000 ▴ 690

0

Entering edit mode

Here is a place to start with pattern matching in snakemake (and python more generally).

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

Seems link is missing :)

ADD REPLY • link 3.7 years ago by User000 ▴ 690

1

Entering edit mode

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards

Sorry about that!

ADD REPLY • link 3.7 years ago by Devon Ryan 104k

0

Entering edit mode

so I just put - instead of _ between the LIBRARIES and SAMPLES in the output file, so it does not confuse.. seems to be working.

ADD REPLY • link 3.7 years ago by User000 ▴ 690