Question

snakemake erros and input files

1

Entering edit mode

24 months ago

Jonathan Yoou ▴ 60

Hi, I'm trying to apply Snakemake to make pipeline for analyses.

I've just begun, so I want to make simple workflow (BLAST amino acid query sequences against my database), but I have no idea why it keeps making an error. Codes I made is below:

#       Configuration file
configfile: "BLAST-config.yaml"

# In configuration file,     
QUERY_PATH_faa: /home/user/study/faa_input/
OUTPUT_PATH_master: /home/user/study/blast_output
DataBaseProt: /home/user/db/protein_ref
BLASTParams: "-evalue 0.01 -perc_identity 70 -word_size 10 sorthits 3"


# FILENAMES_faa contains the list of file name without extensions
FILENAMES_faa = glob_wildcards(config["QUERY_PATH_faa"]+"{fname}.faa").fname


rule all:
    input:
        expand(config["OUTPUT_PATH_master"]+"/BLAST/{filename}.RAW", filename=FILENAMES_faa)



rule RAWBLASTVF:
    input:
        expand(config["QUERY_PATH_faa"]+"{filename}.faa", filename=FILENAMES_faa)
    output:
        expand(config["OUTPUT_PATH_master"]+"BLAST/{filename}.RAW", filename=FILENAMES_faa)
    threads:
        30
    shell:
        """
        blastp {config[BLASTParams]}    \
            -db "{config[DataBaseProt}"       \
            -query {input}      \
            -out {output}       \
            -outfmt 7   \
            -num_threads {threads}
        """

And the error I got is:

MissingInputException in line ~~ of Snakefile:
Missing input files for rule all:
Path/to/Sample_A.RAW

Path/to/Sample_B.RAW

Path/to/Sample_C.RAW

Does anyone have idea what's wrong with the codes? Thank you in advance

snakemake wms analysis • 868 views

ADD COMMENT • link 23 months ago by Jonathan Yoou ▴ 60

score 2 · Answer 1 · 2022-04-29

2

Entering edit mode

24 months ago

Jesse ▴ 740

I think your path strings are getting mangled because of missing slashes.

In the RAWBLASTVF rule you have the output as: config["OUTPUT_PATH_master"]+"BLAST/{filename}.RAW

And the all rule is looking for this input: config["OUTPUT_PATH_master"]+"/BLAST/{filename}.RAW"

And the configuration file has: OUTPUT_PATH_master: /home/user/study/blast_output

...so I think RAWBLASTVF is creating /home/user/study/blast_outputBLAST/{filename}.RAW (no / between those two strings).

I'm a fan of Pathlib to handle this sort of thing. The Path class has a very easy interface and doesn't require much fuss to drop Path objects into place instead of regular strings, and it can save you from cases like this.

ADD COMMENT • link 24 months ago by Jesse ▴ 740

0

Entering edit mode

Hi, thank you for your comment! Now I resolve the problem, but I run into another error, unfortunately.. What I want to do is to BLAST on the file named in the list of "FILENAMES_faa", but if I use above command, snakemake input all files of the list in a single line so that BLAST makes an error (like, I want to BLAST A.faa, B.faa, C.faa one by one for input, but snakemake input them as "A.faa, B.faa, C.faa", which is 3 files for one BLAST run).

Do you have any idea to resolve this? I tried to use For loop in python, but it also makes an error..

ADD REPLY • link 23 months ago by Jonathan Yoou ▴ 60

0

Entering edit mode

Sorry I just saw this comment. You should change your RAWBLASTVF rule to just take a single file as input and make a single file as output (with the {filename} wildcard left unspecified in both). Then your existing "all" rule will still ask for all the various blast output files it wants as input, and instead of having Snakemake think it should run RAWBLASTVF just once to make everything, it'll notice that it needs to run that rule many times separately.

In other words, the first few lines of that rule definition can actually just look like:

rule RAWBLASTVF:
    input:
        config["QUERY_PATH_faa"]+"{filename}.faa"
    output:
        config["OUTPUT_PATH_master"]+"BLAST/{filename}.RAW"

(But including whatever fixes for file paths you've already made!) That way, Snakemake can do its thing and fill in the right filenames for each separate blastp call-- no other code (like loops) needed. Makes sense?

You might also find the -p (--printshellcmds) and -r (--reason) helpful when troubleshooting, especially with -n (--dryrun). The first will show any commands that will be run so you can make sure they look like what you expect, the second will show why a rule is being run (like, missing output file, updated input file, etc.) and the third won't actually run anything yet. There's also --debug-dag which can help with more complicated workflows by explaining what rules it's selecting to supply what output files.

ADD REPLY • link 23 months ago by Jesse ▴ 740

0

Entering edit mode

Sorry for late reply.. And thank you for your super detailed comment! It helps me to fix the error and make the pipe more clean!

ADD REPLY • link 23 months ago by Jonathan Yoou ▴ 60