Snakemake Megahit error
0
0
Entering edit mode
10 weeks ago
blackadder • 0

Hello everyone!

A few days ago I started using Snakemake for the first time. I am having an issue when I am trying to run the megahit rule in my pipeline.

It gives me the following error "Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). ......"

So initially it runs and then crashes with the above error. I implemented the solution with the directory() option in my pipeline but I think its not a good practice since, for various reasons, you can loose files without even knowing it.

Is there a way to run the rule without using the directory() ?

I would appreciate any help on the issue!

Thanking you in advance

sra = []

with open("run_ids") as f:
    for line in f:
        sra.append(line.strip())


rule all:
    input:
        expand("raw_reads/{sample}/{sample}.fastq", sample=sra),
        expand("trimmo/{sample}/{sample}.trimmed.fastq", sample=sra),
        expand("megahit/{sample}/final.contigs.fa", sample=sra)

rule download:
    output:
        "raw_reads/{sample}/{sample}.fastq"
    params:
        "--split-spot --skip-technical"
    log:
        "logs/fasterq-dump/{sample}.log"
    benchmark:
        "benchmarks/fastqdump/{sample}.fasterq-dump.benchmark.txt"
    threads: 8
    shell:
        """
        fasterq-dump {params} --outdir /home/raw_reads/{wildcards.sample}  {wildcards.sample} -e {threads}
        """

rule trim:
    input:
        "raw_reads/{sample}/{sample}.fastq"
    output:
        "trimmo/{sample}/{sample}.trimmed.fastq"
    params:
        "HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
    log:
        "logs/trimmo/{sample}.log"
    benchmark:
        "benchmarks/trimmo/{sample}.trimmo.benchmark.txt"
    threads: 6
    shell:
        """
        trimmomatic SE -phred33 -threads {threads} {input} trimmo/{wildcards.sample}/{wildcards.sample}.trimmed.fastq {params}
        """

rule megahit:
    input:
        "trimmo/{sample}/{sample}.trimmed.fastq"
    output:
        "megahit/{sample}/final.contigs.fa"
    params:
        "-m 0.7 -t"
    log:
        "logs/megahit/{sample}.log"
    benchmark:
        "benchmarks/megahit/{sample}.megahit.benchmark.txt"
    threads: 10
    shell:
        """
        megahit -r {input} -o {output} -t {threads}
        """
megahit Snakemake • 803 views
ADD COMMENT
1
Entering edit mode

Instead of shell, you can use run.

rule megahit:
    ....
    output:
        "megahit/{sample}/final.contigs.fa"
    run:
        outdir = os.path.dirname(output[0])
        shell('megahit -o {outdir} ...')

I don't know anything about megahit. For your desired output, reading the help file suggests you'd also need --out-prefix final.

ADD REPLY
0
Entering edit mode

does this help? megahit [options] {-1 <pe1> -2 <pe2> | --12 <pe12> | -r <se>} [-o <out_dir>] Output is a directory. To megahit, you are supplying a file as an output.

ADD REPLY
0
Entering edit mode

Yes, you are right but I think that in this case does not make a big change..

ADD REPLY
0
Entering edit mode
Yes, you are right but I think that in this case does not make a big change..

what do you mean by that? @ cfos4698 said the same thing. Copy/pasting for you: The error makes sense because megahit is trying to make a directory based on your {output}, but your {output} is actually a fasta file.

ADD REPLY
0
Entering edit mode

I understand that megahit outputs a directory and I am trying to output a file... But i dont want to output a directory because I think its a bad practice...I am trying for a way to make it work without using the snakemake directory() function.

ADD REPLY
0
Entering edit mode

Can you try following, retaining every thing else as same:

rule megahit:
    input:
        "trimmo/{sample}/{sample}.trimmed.fastq"
    output:
        "megahit/{sample}/final.contigs.fa"
    params:
        "-m 0.7 -t"
    log:
        "logs/megahit/{sample}.log"
    benchmark:
        "benchmarks/megahit/{sample}.megahit.benchmark.txt"
    threads: 10
    shell:
        """
        megahit -r {input} -o megahit/{sample} -t {threads}
        """

Btw, megahit can't overwrite. Delete if there are any folders in megahit folder or create a new directory for megahit to store output.

ADD REPLY
0
Entering edit mode

When I run this I am getting:

NameError: The name 'sample' is unknown in this context. Did you mean 'wildcards.sample'?

And when I change the to -o megahit/{wildcards.sample} I am getting:

megahit: Output directory /home/megahit/SRR11192682 already exists, please change the parameter -o to another value to avoid overwriting.

ADD REPLY
0
Entering edit mode

can you rename /home/megahit/SRR11192682 to /home/megahit/SRR11192682_bak and rerun the code?

ADD REPLY
0
Entering edit mode

I tried the following:

  • Shell command:

-o megahit/{wildcards.sample}_bak

  • Output command:

output: "megahit/{sample}_bak/final.contigs.fa"

For both cases I got the following:

MissingOutputException in line 81 of /home/nipy/Snakefile: Job Missing files after 5 seconds: megahit/SRR11192682_bak/final.contigs.fa This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

I tried also adding the --latency-wait but nothing changed.

ADD REPLY
0
Entering edit mode

Don't change the rule. Rename the existing directory.

ADD REPLY
0
Entering edit mode

As I indicated below, --out-prefix is needed.

  Output options:
      -o/--out-dir             <string>       output directory [./megahit_out]
      --out-prefix             <string>       output prefix (the contig file will be OUT_DIR/OUT_PREFIX.contigs.fa)
ADD REPLY
0
Entering edit mode

I think this is the issue OP is facing:

$ megahit -r test.fa -o out
2021-11-15 20:38:44 - MEGAHIT v1.2.9
2021-11-15 20:38:44 - Using megahit_core with POPCNT and BMI2 support
2021-11-15 20:38:44 - Convert reads to binary library
2021-11-15 20:38:44 - b'INFO  sequence/io/sequence_lib.cpp  :   75 - Lib 0 (/home/prot/Desktop/test/test.fa): se, 5 reads, 15 max length'
2021-11-15 20:38:44 - b'INFO  utils/utils.h                 :  152 - Real: 0.0007\tuser: 0.0029\tsys: 0.0000\tmaxrss: 7452'
2021-11-15 20:38:44 - k-max reset to: 29 
2021-11-15 20:38:44 - Start assembly. Number of CPU threads 8 
2021-11-15 20:38:44 - k list: 21,29 
2021-11-15 20:38:44 - Memory used: 7254009446
2021-11-15 20:38:44 - Extract solid (k+1)-mers for k = 21 
2021-11-15 20:38:44 - Build graph for k = 21 
2021-11-15 20:38:45 - Assemble contigs from SdBG for k = 21
2021-11-15 20:38:45 - Local assembly for k = 21
2021-11-15 20:38:45 - Extract iterative edges from k = 21 to 29 
2021-11-15 20:38:45 - Merging to output final contigs 
2021-11-15 20:38:45 - 0 contigs, total 0 bp, min 0 bp, max 0 bp, avg 0 bp, N50 0 bp
2021-11-15 20:38:45 - ALL DONE. Time elapsed: 0.714562 seconds 

Mon 15  8:38PM test % megahit -r test.fa -o out

megahit: Output directory /home/prot/Desktop/test/out already exists, please change the parameter -o to another value to avoid overwriting.

solution (either remove or rename the existing):

Mon 15  8:38PM test % rm -rf out 

Mon 15  8:38PM test % megahit -r test.fa -o out

2021-11-15 20:39:04 - MEGAHIT v1.2.9
2021-11-15 20:39:04 - Using megahit_core with POPCNT and BMI2 support
2021-11-15 20:39:04 - Convert reads to binary library
2021-11-15 20:39:04 - b'INFO  sequence/io/sequence_lib.cpp  :   75 - Lib 0 (/home/prot/Desktop/test/test.fa): se, 5 reads, 15 max length'
2021-11-15 20:39:04 - b'INFO  utils/utils.h                 :  152 - Real: 0.0007\tuser: 0.0032\tsys: 0.0000\tmaxrss: 7448'
2021-11-15 20:39:04 - k-max reset to: 29 
2021-11-15 20:39:04 - Start assembly. Number of CPU threads 8 
2021-11-15 20:39:04 - k list: 21,29 
2021-11-15 20:39:04 - Memory used: 7254009446
2021-11-15 20:39:04 - Extract solid (k+1)-mers for k = 21 
2021-11-15 20:39:04 - Build graph for k = 21 
2021-11-15 20:39:04 - Assemble contigs from SdBG for k = 21
2021-11-15 20:39:04 - Local assembly for k = 21
2021-11-15 20:39:04 - Extract iterative edges from k = 21 to 29 
2021-11-15 20:39:04 - Merging to output final contigs 
2021-11-15 20:39:04 - 0 contigs, total 0 bp, min 0 bp, max 0 bp, avg 0 bp, N50 0 bp
2021-11-15 20:39:04 - ALL DONE. Time elapsed: 0.720330 seconds 

Instead User seems to have changed the rule, but didn't change other relevant lines. Megahit can't overwrite by default. Instead it suggests to give another directory to write.

ADD REPLY
0
Entering edit mode

Snakemake is expecting megahit/{sample}_bak/final.contigs.fa. The MissingOutputException comes from not using --out-prefix final.

ADD REPLY
0
Entering edit mode

OP was troubleshooting this problem.

megahit: Output directory /home/megahit/SRR11192682 already exists, please change the parameter -o to another value to avoid overwriting.

I suggested him to rename to existing output SRR11192682. Instead OP changed the rule. But I guess OP hasn't edited rest of the snakefile as per the new change.

ADD REPLY
0
Entering edit mode

Hello both and thank you for your help! With your input I managed to make a sloppy solution!

rule megahit:
input:
    "trimmo/{sample}/{sample}.trimmed.fastq"
output:
    "megahit/{sample}/{sample}.contigs.fa"
params:
    "0.7"
log:
    "logs/megahit/{sample}.log"
benchmark:
    "benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads: 10
shell:
    """
    megahit --12 {input} -m {params} -o {wildcards.sample} --out-prefix /home/megahit/{wildcards.sample}/{wildcards.sample} -t {threads}
    rm -r {wildcards.sample}
    """

Indeed the --out-prefix helped!

So, now the rule runs. Only the {wildcards.sample}.contigs.fa is being send to /megahit/{sample} directory and the rest of the megahit output is being sent to separate directories for each run ID

I think its not ideal but it works for now.

ADD REPLY
0
Entering edit mode

The error makes sense because megahit is trying to make a directory based on your {output}, but your {output} is actually a fasta file. Can you try the following?

rule megahit:
    input:
        "trimmo/{sample}/{sample}.trimmed.fastq"
    output:
        "megahit/{sample}/final.contigs.fa"
    params:
        outdir = "megahit/{sample}"
    log:
        "logs/megahit/{sample}.log"
    benchmark:
        "benchmarks/megahit/{sample}.megahit.benchmark.txt"
    threads: 10
    shell:
        """
        megahit -r {input} -o {params.outdir} -t {threads} -m 0.7
        """
ADD REPLY
0
Entering edit mode

Hello! Thank you for the input. If i do what you suggest I am getting the following error:

megahit: output directory /home/megahit/SRR11192684 already exists, please change the parameter -o to another value to avoid overwriting

ADD REPLY

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6