snakemake rule input from multiple directories and proceed them separatly
2
0
Entering edit mode
4.2 years ago

Hi there!

I have a new problem with snakemake. There is mutiple directories containing fastq files:

  • DATASET/rep1/*.fastq
  • DATASET/rep2/*.fastq
  • DATASET/rep3/*.fastq

I want to merge each fastq from each repositories separatly. This piece of code work well:

for barcode in `ls DATASET`
do
    cat DATASET/${barcode}/*.fastq > MERGE/${barcode}_merged.fastq
done

But I have no clue about how translate this into a snakemake rule, especially the input. Possible to have some tips and solve this problem?

Thanks

snakemake MinION • 4.7k views
ADD COMMENT
1
Entering edit mode

Are the prefixes for *.fastq the same?

ADD REPLY
0
Entering edit mode

indeed. In each rep, fastq are the same.

ADD REPLY
2
Entering edit mode
4.2 years ago

Since the prefixes are the same in each replicate, just choose one at random and use the following:

rule merge:
    input: "DATASET/{barcode}/some_prefix.fastq"
    output: "MERGED/{barcode}_merged.fastq"
    shell: """
        cat DATASET/{wildcards.barcode}/* > {output}
        """

If you needed something more generalizable, I would suggest making a dictionary with keys "barcode" and values the list of fastq files in each directory. You can then use a lambda function as the input for the merge rule.

ADD COMMENT
0
Entering edit mode

Thx, this working perfectly! I'll try to make this dictionary.

ADD REPLY
1
Entering edit mode

If you want an example (I mess up lambda functions 100% of the time without one), you can see one here.

ADD REPLY
0
Entering edit mode

So i've just created this dict:

list_fastq={}
for barcode in os.listdir('DATASET/'):
    list_fastq[barcode]=os.listdir('DATASET/'+barcode)

I'm supposed to write an input like this?:

 input: lambda wildcards: expand("DATASET/{barcode}/{read}", read=list_fastq[wildcards.read])

I'm probably wrong when using this dictionnary.

ADD REPLY
1
Entering edit mode

Something like that, make sure to print(list_fastq) so you can ensure you're getting something sane there.

ADD REPLY
1
Entering edit mode

Huge thanks for youre help and the dictionary advice! This code work:

(BARCODE) = os.listdir('DATASET/')

list_fastq={}
for barcode in os.listdir('DATASET/'):
    list_fastq[barcode]=os.listdir('DATASET/'+barcode)

rule all:
    input:
        merged_file = expand('MERGED/{barcode}_merged.fastq',barcode=BARCODE)

rule merge:
    input: 
        lambda wildcards: expand("DATASET/{barcode}/{read}", read=list_fastq[wildcards.barcode],barcode=BARCODE)
    output: 
        "MERGED/{barcode}_merged.fastq"
    shell: 
        """
        cat DATASET/{wildcards.barcode}/* > {output}
        """

best regards

ADD REPLY
0
Entering edit mode

Would you happen to know how to set this up for files that are being dynamically generated? I'm actually running into a similar issue where files are being generated in a previous snakemake rule and I'd like to aggregate them so I can merge them. I've tried running a checkpoint with this but I'm unable to get this working, if you have any suggestions, I would appreciate it! Here is some sample code below, my checkpoint is supposed to take files from the specificity rule and aggregate them in rule merge.

checkpoint unpredictable_file_generator:
input:
    "results/initial_specificity_out/pre_filter/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}"
output:
    directory("results/initial_specificity_out/check_merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}"),
shell:
    """
    mkdir -p {output}

    for some_file in {input}/*
    do
        outfile=$(basename "$some_file")
        if [[ "$outfile" =~ ^c.* ]]
        then
            touch "{output}/$outfile"
        fi
    done
    """

rule intermediate_step:
input:
    "results/initial_specificity_out/check_merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt"
output:
    "results/initial_specificity_out/intermediate/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt"
shell:
    """
    cp {input} {output}
    """


def gather_from_checkpoint(wildcards):
checkpoint_output = checkpoints.unpredictable_file_generator.get(**wildcards).output[0]
print(checkpoint_output)
return expand("results/initial_specificity_out/intermediate/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt",
       window=wildcards.WINDOW,
       threshold=wildcards.THRESH,
       composition=wildcards.COMPOS,
       enrich_score=wildcards.ENRICH,
       copy_num=wildcards.COPY_N,
       file_id=glob_wildcards(os.path.join(checkpoint_output, "{file_id}.txt")).file_id)

rule merge:
input:
    gather_from_checkpoint
output:
    "results/initial_specificity_out/merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}_merged.bed"
conda:
    'envs/tigerfish_pipeline.yml'
params:
    mfree="100G",
    h_rt="120:0:0"
shell:
    """
    # Generate a file for each run
    # Containing the names of all files output by 'unknown_file_generator'
    echo {input} | tr ' ' $'\\n' > {output}
    """
ADD REPLY
1
Entering edit mode
4.2 years ago

A bit rough, but this should work:

import glob,os

DATADIR="./DATASET"

rule main:
    input: [f"{DATADIR}/merged/{os.path.basename(d)}.fastq" for d in glob.glob(f"{DATADIR}/*")]

rule merge:
    input:
        lambda wc: sorted([f for f in glob.glob(f"{DATADIR}/{wc.d}/*.fastq")])
    output:
        f"{DATADIR}/merged/{{d}}.fastq"
    shell:"""
        cat {input} > {output}
    """
ADD COMMENT

Login before adding your answer.

Traffic: 3310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6