Question

snakemake rule input from multiple directories and proceed them separatly

0

Entering edit mode

4.2 years ago

regue.hadrien ▴ 40

Hi there!

I have a new problem with snakemake. There is mutiple directories containing fastq files:

DATASET/rep1/*.fastq
DATASET/rep2/*.fastq
DATASET/rep3/*.fastq

I want to merge each fastq from each repositories separatly. This piece of code work well:

for barcode in `ls DATASET`
do
    cat DATASET/${barcode}/*.fastq > MERGE/${barcode}_merged.fastq
done

But I have no clue about how translate this into a snakemake rule, especially the input. Possible to have some tips and solve this problem?

Thanks

snakemake MinION • 4.7k views

ADD COMMENT • link updated 4.2 years ago by Devon Ryan 104k • written 4.2 years ago by regue.hadrien ▴ 40

1

Entering edit mode

Are the prefixes for *.fastq the same?

ADD REPLY • link 4.2 years ago by Devon Ryan 104k

0

Entering edit mode

indeed. In each rep, fastq are the same.

ADD REPLY • link 4.2 years ago by regue.hadrien ▴ 40

1

Entering edit mode

4.2 years ago

Tomás Di Domenico ▴ 30

A bit rough, but this should work:

import glob,os

DATADIR="./DATASET"

rule main:
    input: [f"{DATADIR}/merged/{os.path.basename(d)}.fastq" for d in glob.glob(f"{DATADIR}/*")]

rule merge:
    input:
        lambda wc: sorted([f for f in glob.glob(f"{DATADIR}/{wc.d}/*.fastq")])
    output:
        f"{DATADIR}/merged/{{d}}.fastq"
    shell:"""
        cat {input} > {output}
    """

ADD COMMENT • link 4.2 years ago by Tomás Di Domenico ▴ 30

score 2 · Accepted Answer · 2020-02-03

2

Entering edit mode

4.2 years ago

Devon Ryan 104k

Since the prefixes are the same in each replicate, just choose one at random and use the following:

rule merge:
    input: "DATASET/{barcode}/some_prefix.fastq"
    output: "MERGED/{barcode}_merged.fastq"
    shell: """
        cat DATASET/{wildcards.barcode}/* > {output}
        """

If you needed something more generalizable, I would suggest making a dictionary with keys "barcode" and values the list of fastq files in each directory. You can then use a lambda function as the input for the merge rule.

ADD COMMENT • link 4.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thx, this working perfectly! I'll try to make this dictionary.

ADD REPLY • link 4.2 years ago by regue.hadrien ▴ 40

1

Entering edit mode

If you want an example (I mess up lambda functions 100% of the time without one), you can see one here.

ADD REPLY • link 4.2 years ago by Devon Ryan 104k

0

Entering edit mode

So i've just created this dict:

list_fastq={}
for barcode in os.listdir('DATASET/'):
    list_fastq[barcode]=os.listdir('DATASET/'+barcode)

I'm supposed to write an input like this?:

 input: lambda wildcards: expand("DATASET/{barcode}/{read}", read=list_fastq[wildcards.read])

I'm probably wrong when using this dictionnary.

ADD REPLY • link 4.2 years ago by regue.hadrien ▴ 40

1

Entering edit mode

Something like that, make sure to print(list_fastq) so you can ensure you're getting something sane there.

ADD REPLY • link 4.2 years ago by Devon Ryan 104k

1

Entering edit mode

Huge thanks for youre help and the dictionary advice! This code work:

(BARCODE) = os.listdir('DATASET/')

list_fastq={}
for barcode in os.listdir('DATASET/'):
    list_fastq[barcode]=os.listdir('DATASET/'+barcode)

rule all:
    input:
        merged_file = expand('MERGED/{barcode}_merged.fastq',barcode=BARCODE)

rule merge:
    input: 
        lambda wildcards: expand("DATASET/{barcode}/{read}", read=list_fastq[wildcards.barcode],barcode=BARCODE)
    output: 
        "MERGED/{barcode}_merged.fastq"
    shell: 
        """
        cat DATASET/{wildcards.barcode}/* > {output}
        """

best regards

ADD REPLY • link 4.2 years ago by regue.hadrien ▴ 40

0

Entering edit mode

Would you happen to know how to set this up for files that are being dynamically generated? I'm actually running into a similar issue where files are being generated in a previous snakemake rule and I'd like to aggregate them so I can merge them. I've tried running a checkpoint with this but I'm unable to get this working, if you have any suggestions, I would appreciate it! Here is some sample code below, my checkpoint is supposed to take files from the specificity rule and aggregate them in rule merge.

checkpoint unpredictable_file_generator:
input:
    "results/initial_specificity_out/pre_filter/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}"
output:
    directory("results/initial_specificity_out/check_merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}"),
shell:
    """
    mkdir -p {output}

    for some_file in {input}/*
    do
        outfile=$(basename "$some_file")
        if [[ "$outfile" =~ ^c.* ]]
        then
            touch "{output}/$outfile"
        fi
    done
    """

rule intermediate_step:
input:
    "results/initial_specificity_out/check_merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt"
output:
    "results/initial_specificity_out/intermediate/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt"
shell:
    """
    cp {input} {output}
    """


def gather_from_checkpoint(wildcards):
checkpoint_output = checkpoints.unpredictable_file_generator.get(**wildcards).output[0]
print(checkpoint_output)
return expand("results/initial_specificity_out/intermediate/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}/{file_id}.txt",
       window=wildcards.WINDOW,
       threshold=wildcards.THRESH,
       composition=wildcards.COMPOS,
       enrich_score=wildcards.ENRICH,
       copy_num=wildcards.COPY_N,
       file_id=glob_wildcards(os.path.join(checkpoint_output, "{file_id}.txt")).file_id)

rule merge:
input:
    gather_from_checkpoint
output:
    "results/initial_specificity_out/merge/w{window}_t{threshold}_c{composition}_e{enrich_score}_cn{copy_num}_merged.bed"
conda:
    'envs/tigerfish_pipeline.yml'
params:
    mfree="100G",
    h_rt="120:0:0"
shell:
    """
    # Generate a file for each run
    # Containing the names of all files output by 'unknown_file_generator'
    echo {input} | tr ' ' $'\\n' > {output}
    """

ADD REPLY • link 3.5 years ago by eaguil • 0