Question: Samtools merge BAM files using snakemake
0
gravatar for User000
5 weeks ago by
User000380
User000380 wrote:

Hello,

I have two questions regarding merging the BAM files. I have several libraries 4 lanes each. I demultiplexed, mapped against the reference genome, gave the read group name. I am using snakemake and samtools merge.

These are my input files:

Bob_1_CL1.bam
Bob_2_CL1.bam
Bob_3_CL1.bam
Bob_4_CL1.bam
John_1_CL40.bam
John_2_CL40.bam
John_3_CL40.bam
John_4_CL40.bam

This is the code I am trying:

library=["Bob", "John"]
(LIBRARIES, SAMPLES)=glob_wildcards("path/to/{library}_1_{sample}.bam")

rule all:
    input: 
        expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES)

rule samtools merge:
    input:
        lane1="path/to/{library}_1_{sample}.bam",
        lane2="path/to/{library}_2_{sample}.bam",
        lane3="path/to/{library}_3_{sample}.bam",
        lane4="path/to/{library}_4_{sample}.bam"
    output:
        outf = "{library}_{sample}.merged.bam"
    threads: 4
    shell:
        """/Tools/samtools-1.10/samtools merge -@ {threads} {input.lane1} {input.lane2} {input.lane3} {input.lane4} -o {output.outf}"""

I am trying to merge 4 lanes in one single BAM file in order to get Bob_CL1.merged.bam and John_CL40.merged.bam, the problem is that CL number (NOTE: it is not always CL) is different for different libraries, i.e. CL1 is present in Bob, but not in John, and CL40 is for John but not in Bob. so it is giving me an error:

Missing input files for rule merge:
John_1_CL1.bam
John_2_CL1.bam
John_3_CL1.bam
John_4_CL1.bam

So my questions are:

  1. How to solve this?
  2. Do I have to merge the lanes in one single BAM file prior to marking duplicates?
snakemake samtools bam • 210 views
ADD COMMENTlink modified 5 weeks ago by Devon Ryan96k • written 5 weeks ago by User000380
2
gravatar for Devon Ryan
5 weeks ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:
  1. If you print the output of expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES) you'll see the problem. Replace expand() with something like ['{}_{}.merged.bam'.format(x, y) for x, y in zip(LIBRARIES, SAMPLES)]
  2. Yes, PCR duplicates are library-level events, so you need to process the entire library at once.
ADD COMMENTlink written 5 weeks ago by Devon Ryan96k

thanks a lot. I tried with your solution, but this is changing the positions of the file names. Input: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 to Bob_Smith_1_1, Bob_Smith_2_1,Bob_Smith_3_1,Bob_Smith_4_1 and consequently does not find the file names.

ADD REPLYlink written 5 weeks ago by User000380

Your input now looks nothing like the example you gave in the original post. Where is Smith coming from? Print out LIBRARIES and SAMPLES to debug this.

ADD REPLYlink written 5 weeks ago by Devon Ryan96k

this are the parents of the RIL populations that were sequenced as well. I have mentioned above that it is not always CL.

ADD REPLYlink written 5 weeks ago by User000380

Please post the contents of LIBRARIES and SAMPLES.

ADD REPLYlink written 5 weeks ago by Devon Ryan96k

They are a lot. Apart from 4 parents which has this format: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 I have CL from 1 to 400. I hope this helps, because I do not know if I can post the contents here.

ADD REPLYlink written 5 weeks ago by User000380

That shouldn't be either LIBRARIES or SAMPLES. We can't help you further if you can't provide debugging output.

ADD REPLYlink written 5 weeks ago by Devon Ryan96k

Sorry, you are right. I printed a small part of it, should be something like this:

['Bob', 'John', 'Arnold']    
['CL185', 'CL203', 'CL192', 'CL211', 'Cap_2', 'CL472', 'CL236', 'CL269', 'CL222', 'CL216', 'CL65', 'CL235', 'CL350', 'CL306', 'CL442', 'CL430', 'CL201', 'Smith_4', 'CL459', 'CL177', 'CL478', 'CL171', 'CL94', 'CL274', 'James_3', 'CL398', 'CL50', 'CL1', 'CL248', 'CL389', 'CL254', 'CL301', 'CL394', 'CL411', 'CL133', 'CL265', 'CL377', 'CL368', 'CL69', 'CL386', 'CL104', 'James_4', 'Smith_2']
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by User000380
1

Since there isn't a 1:1 relationship between the LIBRARIES and SAMPLES you'll either need to use something other than glob_wildcards or modify the results such that the number of libraries matches the number of samples (you'd repeat the libraries so the lists make sense together).

You may want to restrict the pattern matching in outf so that LIBRARIES can only match Bob, John, and Arnold. That should solve the remaining issues.

ADD REPLYlink written 5 weeks ago by Devon Ryan96k

Actually, the number of libraries match the number of samples, sorry, my bad. Regarding the second part of your answer, could you give more details please? May be a link where I can find more information? Thanks

ADD REPLYlink written 5 weeks ago by User000380

Here is a place to start with pattern matching in snakemake (and python more generally).

ADD REPLYlink written 5 weeks ago by Devon Ryan96k

Seems link is missing :)

ADD REPLYlink written 4 weeks ago by User000380
1

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards

Sorry about that!

ADD REPLYlink written 4 weeks ago by Devon Ryan96k

so I just put - instead of _ between the LIBRARIES and SAMPLES in the output file, so it does not confuse.. seems to be working.

ADD REPLYlink written 5 weeks ago by User000380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1136 users visited in the last hour