Question: how to index different genomes with snakemake with different parameters?
1
gravatar for Assa Yeroslaviz
4 weeks ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

I would like to use snakemake to analyze my data sets. As I am going to work with different organisms, I would like snakemake to create a folder for each of them when indexing the genome.

let's say I would like to work with human and mouse data.

In my config.yaml file I have the following snippet:

organism:
 "Dmel"
# "Dpse"

Each time I am working with an organism I can comment out the others in the list.

Below are the rules for getting the fastA files as well as the annotation and the indexing step (as an example) with STAR.

#### get reference genomic data for the mapping (fasta and gtf files). Links are added in the config file
rule get_genome:
   output:
         fastA="genome/genome.fa",
         gtf="genome/genome.gtf"
   shell:
         """
         wget -nc -O - {config[fastA]} | gunzip -c - > {output.fastA}
         wget -nc -O - {config[gtf]}   | gunzip -c - > {output.gtf}
         """

### Indexing the reference genome

rule star_index:
   input:
          fasta="genome/genome.fa",
          gtf="genome/genome.gtf"
   output:
          directory("genome/starIndex/")
   threads: 16
   params:
         prefix = {config["organism"]}
   shell:
          "mkdir -p {output} && "
          "STAR --runThreadN {threads} "
          "--outFileNamePrefix {output}{params.prefix} "
          "--runMode genomeGenerate "
          "--genomeDir {output} "
          "--limitGenomeGenerateRAM {config[RAM]} "
          "--genomeSAindexNbases {config[SAindex]} "
          "--genomeFastaFiles {input.fasta} "
          "--sjdbGTFfile {input.gtf} "
          "--sjdbOverhang 100"

I would like to know how to add the value from the config file in the organism, e.g. config[organism] into the rules, so that, when I download the fastA and gtf they will be renamed accordingly.

I have tried to add config[organism] to the path, but it gives me an error. I have also tried this for example:

rule get_genome:
   output:
         fastA="genome/{config[organism]}.fa",
         gtf="genome/{config[organism]}.gtf"

But it just creates the files {config[organism]}.gtf and {config[organism]}.fa.

Can someone please help me understand how I can add the parameter from the config file to the different paths in the various rules?

thanks

mapping snakemake indexing • 114 views
ADD COMMENTlink modified 4 weeks ago by Eric Lim1.6k • written 4 weeks ago by Assa Yeroslaviz1.2k
3
gravatar for Eric Lim
4 weeks ago by
Eric Lim1.6k
Stoke Therapeutics, Inc
Eric Lim1.6k wrote:

You'd need wildcards. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards

[~/Data/scratch/tmp/biostar]$ cat wildcards.py
rule:
    input: expand("{organisms}/starIndex/", organisms=config['organism'])

rule get_genome:
    output: fa = touch('{organism}/{organism}.fa'),
            gtf = touch('{organism}/{organism}.gtf')
    run:
        # remove `touch` and implement logic to download fa and gtf
        # you can access organism via wildcards, i.e. config[wildcards.organism]
        pass

rule star_index:
    input: fa = '{organism}/{organism}.fa',
           gtf = '{organism}/{organism}.gtf'
    output: directory("{organism}/starIndex/")
    shell: 'mkdir -p {output}'

[~/Data/scratch/tmp/biostar]$ snakemake -s wildcards.py --config organism=GRCh38
...
[~/Data/scratch/tmp/biostar]$ tree GRCh38/
GRCh38/
├── GRCh38.fa
├── GRCh38.gtf
└── starIndex

1 directory, 2 files

In this case, config['organism'] can be a list of organisms and the workflow will expand itself. Hope the example helps.

ADD COMMENTlink written 4 weeks ago by Eric Lim1.6k

thanks for this great example. I will try it immediately. I was just wondering what you mean in the run comment.

Something in the way adding a list in the config file like:

organism:
     ["GRCh38", "GRCm38", "BDGP6.22"]

But where does snakemake know how to get the link for each of these genomes?

ADD REPLYlink written 4 weeks ago by Assa Yeroslaviz1.2k
2

That'd be completely up to your design in config.yaml. For instance, if your yaml looks like this,

[~/Data/scratch/tmp/biostar]$ cat config.yaml 
organisms:
  human:
    url: google.com
  mouse:
    url: yahoo.com

You can access the url via config['organisms'][wildcards.organism]['url'] in get_genome.

Your recipe can then be defined with the following,

rule:
    input: expand("{organisms}/starIndex/", organisms=config['organisms'].keys())
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Eric Lim1.6k

I don't know what happened, I thought I have got it, but it seems to be not working anymore.

This is how my config.yaml file look like

organism:
  Dmel:
    fasta: "ftp://ftp.ensembl.org/pub/current_fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz"
    gtf: "ftp://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.22.98.gtf.gz"
  Dpse:
    fasta: "ftp://ftp.ensemblgenomes.org/pub/current/metazoa/fasta/drosophila_pseudoobscura/dna/Drosophila_pseudoobscura.Dpse_3.0.dna.toplevel.fa.gz"
    gtf: "ftp://ftp.ensemblgenomes.org/pub/current/metazoa/gtf/drosophila_pseudoobscura/Drosophila_pseudoobscura.Dpse_3.0.45.gtf.gz"

and than I have my snakefile

configfile: "config.yaml"

rule all:
    input:
        expand("genome/{org}/starIndex/", org=config['organism']),

    #### get reference genomic data for the mapping (fasta and gtf files). Links are added in the config file
rule get_genome:
   output:
         fastA="genome/{org}.fa",
         gtf="genome/{org}.gtf"
   shell:
         """
         wget -nc -O - {config['organism'][wildcards.org]['fasta']} | gunzip -c - > {output.fastA}
         wget -nc -O - {config['organism'][wildcards.org]['gtf']}   | gunzip -c - > {output.gtf}
         """      
### Indexing the reference genome  
rule star_index:
   input:
          fasta="genome/{org}.fa",
          gtf="genome/{org}.gtf"
   output:
          directory("genome/{org}/starIndex/")
   threads: 16
   params:
         prefix = lambda wildcards: "{org}".format(org=wildcards.org)
   shell:
          "mkdir -p {output} && "
          "STAR --runThreadN {threads} "
   ...

But when I try to run it I get this error message:

$snakemake -s getGenome_IndexGenome.Snakefile  -np -j 30
Building DAG of jobs...
Job counts:
        count   jobs
        1       all
        2       get_genome
        2       star_index
        5

[Mon Nov 18 10:59:16 2019]
rule get_genome:
    output: genome/Dmel.fa, genome/Dmel.gtf
    jobid: 4
    wildcards: org=Dmel

RuleException in line 16 of /local/Assa/projects/automation/P135.automation/getGenome_IndexGenome.Snakefile:
NameError: The name "'organism'" is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

But the path to my fasta file should be correct, as it seems from the command

print(config['organism']['Dmel']['fasta'])

which gives me this ftp://ftp.ensembl.org/pub/current_fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz

Where do I go wrong here?

ADD REPLYlink written 17 days ago by Assa Yeroslaviz1.2k
1

Sorry for the late response. I was away this weekend. Snakemake's bracket markup only replaces a variable with its string representation. In other words, it doesn't evaluate code. To accomplish what you're trying to do, I'd use something like a param.

rule get_genome:
    output: fa="genome/{org}.fa"
    params: url = lambda wildcards: config['organism'][wildcards.org]['fasta']
    shell: 'wget -nc -O - {params.url} | gunzip -c - > {output.fa}'
ADD REPLYlink written 17 days ago by Eric Lim1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1752 users visited in the last hour