Samtools merge BAM files using snakemake
1
0
Entering edit mode
9 months ago
User000 ▴ 440

Hello,

I have two questions regarding merging the BAM files. I have several libraries 4 lanes each. I demultiplexed, mapped against the reference genome, gave the read group name. I am using snakemake and samtools merge.

These are my input files:

Bob_1_CL1.bam
Bob_2_CL1.bam
Bob_3_CL1.bam
Bob_4_CL1.bam
John_1_CL40.bam
John_2_CL40.bam
John_3_CL40.bam
John_4_CL40.bam


This is the code I am trying:

library=["Bob", "John"]
(LIBRARIES, SAMPLES)=glob_wildcards("path/to/{library}_1_{sample}.bam")

rule all:
input:
expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES)

rule samtools merge:
input:
lane1="path/to/{library}_1_{sample}.bam",
lane2="path/to/{library}_2_{sample}.bam",
lane3="path/to/{library}_3_{sample}.bam",
lane4="path/to/{library}_4_{sample}.bam"
output:
outf = "{library}_{sample}.merged.bam"
shell:
"""/Tools/samtools-1.10/samtools merge -@ {threads} {input.lane1} {input.lane2} {input.lane3} {input.lane4} -o {output.outf}"""


I am trying to merge 4 lanes in one single BAM file in order to get Bob_CL1.merged.bam and John_CL40.merged.bam, the problem is that CL number (NOTE: it is not always CL) is different for different libraries, i.e. CL1 is present in Bob, but not in John, and CL40 is for John but not in Bob. so it is giving me an error:

Missing input files for rule merge:
John_1_CL1.bam
John_2_CL1.bam
John_3_CL1.bam
John_4_CL1.bam


So my questions are:

1. How to solve this?
2. Do I have to merge the lanes in one single BAM file prior to marking duplicates?
samtools snakemake bam • 602 views
2
Entering edit mode
9 months ago
1. If you print the output of expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES) you'll see the problem. Replace expand() with something like ['{}_{}.merged.bam'.format(x, y) for x, y in zip(LIBRARIES, SAMPLES)]
2. Yes, PCR duplicates are library-level events, so you need to process the entire library at once.
0
Entering edit mode

thanks a lot. I tried with your solution, but this is changing the positions of the file names. Input: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 to Bob_Smith_1_1, Bob_Smith_2_1,Bob_Smith_3_1,Bob_Smith_4_1 and consequently does not find the file names.

0
Entering edit mode

Your input now looks nothing like the example you gave in the original post. Where is Smith coming from? Print out LIBRARIES and SAMPLES to debug this.

0
Entering edit mode

this are the parents of the RIL populations that were sequenced as well. I have mentioned above that it is not always CL.

0
Entering edit mode

Please post the contents of LIBRARIES and SAMPLES.

0
Entering edit mode

They are a lot. Apart from 4 parents which has this format: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1 I have CL from 1 to 400. I hope this helps, because I do not know if I can post the contents here.

0
Entering edit mode

That shouldn't be either LIBRARIES or SAMPLES. We can't help you further if you can't provide debugging output.

0
Entering edit mode

Sorry, you are right. I printed a small part of it, should be something like this:

['Bob', 'John', 'Arnold']
['CL185', 'CL203', 'CL192', 'CL211', 'Cap_2', 'CL472', 'CL236', 'CL269', 'CL222', 'CL216', 'CL65', 'CL235', 'CL350', 'CL306', 'CL442', 'CL430', 'CL201', 'Smith_4', 'CL459', 'CL177', 'CL478', 'CL171', 'CL94', 'CL274', 'James_3', 'CL398', 'CL50', 'CL1', 'CL248', 'CL389', 'CL254', 'CL301', 'CL394', 'CL411', 'CL133', 'CL265', 'CL377', 'CL368', 'CL69', 'CL386', 'CL104', 'James_4', 'Smith_2']

1
Entering edit mode

Since there isn't a 1:1 relationship between the LIBRARIES and SAMPLES you'll either need to use something other than glob_wildcards or modify the results such that the number of libraries matches the number of samples (you'd repeat the libraries so the lists make sense together).

You may want to restrict the pattern matching in outf so that LIBRARIES can only match Bob, John, and Arnold. That should solve the remaining issues.

0
Entering edit mode

0
Entering edit mode

Here is a place to start with pattern matching in snakemake (and python more generally).

0
Entering edit mode

1
Entering edit mode
0
Entering edit mode

so I just put - instead of _ between the LIBRARIES and SAMPLES in the output file, so it does not confuse.. seems to be working.