Hello,
I have two questions regarding merging the BAM
files. I have several libraries 4 lanes each. I demultiplexed, mapped against the reference genome, gave the read group name. I am using snakemake
and samtools merge
.
These are my input files:
Bob_1_CL1.bam
Bob_2_CL1.bam
Bob_3_CL1.bam
Bob_4_CL1.bam
John_1_CL40.bam
John_2_CL40.bam
John_3_CL40.bam
John_4_CL40.bam
This is the code I am trying:
library=["Bob", "John"]
(LIBRARIES, SAMPLES)=glob_wildcards("path/to/{library}_1_{sample}.bam")
rule all:
input:
expand("{library}_{sample}.merged.bam", library=LIBRARIES, sample=SAMPLES)
rule samtools merge:
input:
lane1="path/to/{library}_1_{sample}.bam",
lane2="path/to/{library}_2_{sample}.bam",
lane3="path/to/{library}_3_{sample}.bam",
lane4="path/to/{library}_4_{sample}.bam"
output:
outf = "{library}_{sample}.merged.bam"
threads: 4
shell:
"""/Tools/samtools-1.10/samtools merge -@ {threads} {input.lane1} {input.lane2} {input.lane3} {input.lane4} -o {output.outf}"""
I am trying to merge 4 lanes in one single BAM
file in order to get Bob_CL1.merged.bam
and John_CL40.merged.bam
, the problem is that CL number (NOTE: it is not always CL) is different for different libraries, i.e. CL1 is present in Bob, but not in John, and CL40 is for John but not in Bob. so it is giving me an error:
Missing input files for rule merge:
John_1_CL1.bam
John_2_CL1.bam
John_3_CL1.bam
John_4_CL1.bam
So my questions are:
- How to solve this?
- Do I have to merge the lanes in one single
BAM
file prior to marking duplicates?
thanks a lot. I tried with your solution, but this is changing the positions of the file names.
Input: Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1
toBob_Smith_1_1, Bob_Smith_2_1,Bob_Smith_3_1,Bob_Smith_4_1
and consequently does not find the file names.Your input now looks nothing like the example you gave in the original post. Where is
Smith
coming from? Print outLIBRARIES
andSAMPLES
to debug this.this are the parents of the RIL populations that were sequenced as well. I have mentioned above that it is not always CL.
Please post the contents of
LIBRARIES
andSAMPLES
.They are a lot. Apart from 4 parents which has this format:
Bob_1_Smith_1, Bob_2_Smith_1,Bob_3_Smith_1,Bob_4_Smith_1
I have CL from 1 to 400. I hope this helps, because I do not know if I can post the contents here.That shouldn't be either LIBRARIES or SAMPLES. We can't help you further if you can't provide debugging output.
Sorry, you are right. I printed a small part of it, should be something like this:
Since there isn't a 1:1 relationship between the LIBRARIES and SAMPLES you'll either need to use something other than
glob_wildcards
or modify the results such that the number of libraries matches the number of samples (you'd repeat the libraries so the lists make sense together).You may want to restrict the pattern matching in outf so that LIBRARIES can only match Bob, John, and Arnold. That should solve the remaining issues.
Actually, the number of libraries match the number of samples, sorry, my bad. Regarding the second part of your answer, could you give more details please? May be a link where I can find more information? Thanks
Here is a place to start with pattern matching in snakemake (and python more generally).
Seems link is missing :)
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
Sorry about that!
so I just put
-
instead of_
between theLIBRARIES
andSAMPLES
in the output file, so it does not confuse.. seems to be working.