Question

Using snakemake in order to download reads and run hisat2 mapping

0

Entering edit mode

3.3 years ago

Chvatil ▴ 130

Hello everyone, I would need help in order to download and run a hisat2 pipeline on snakemake.

In fact I have to run 3 different hisat2 mapping on 3 assemblies.

For that I have a dataframe where I have the reads IDs I want to map for each assembly name such as :

SRA_accession Assembly 
SRR1                 Assembly1
SRR2                 Assembly1
SRR3                 Assembly1
SRR1                 Assembly2
SRR2                 Assembly2
SRR3                 Assembly2
SRR1                 Assembly3
SRR2                 Assembly3

where we can find the assemblies here: /data/Genomes/AssemblyN/Assembly.fa

and SRA reads here for each assembly :

/data/Genomes/AsemblyN/reads/

So for now I use a python script in order to generate the following scripts (here it is an example for Assembly1)

#I download all the reads by doing: (Here the sra_accession are fake of course)

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR1 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR1 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR1* 

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR2 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR2-O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR2* 

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR3 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR3 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR3*

#Then I run the hisat2 soft

hisat2 --dta -k 1 -q -x mapping_index -1 SRR1_1.fastq.gz,SRR2_1.fastq.gz,SRR3_1.fastq.gz -2 SRR1_2.fastq.gz,SRR2_2.fastq.gz,SRR3_2.fastq.gz  | samtools view -o mapping_Assembly1.bam 2> stats_mapping.txt

But I wondered if someone had an idea in order to do it simply with a snakemake pipeline ? I do not know how to handle the fact that I cant to run specific reads for specific assemblies and how to add a list of read ids on Hisat2. I'm really new in this topic and it would be amazing if someone can help me on that.

Thank you very much and have a nice day.

hisat2 snakemake python • 2.0k views

ADD COMMENT • link updated 3.2 years ago by Jeremy Leipzig 22k • written 3.3 years ago by Chvatil ▴ 130

score 2 · Accepted Answer · 2021-04-26

2

Entering edit mode

3.3 years ago

Jeremy Leipzig 22k

Read your dataframe into a pandas dataframe and create explicit targets from it (Assembly1, Assembly2)

Write lookup functions that are able to identify inputs from outputs

def lookupSRR(assembly){.... return list of SRRs}

def getSRRxargs(mapping){... return commasep SRRs}

Write recipes that are able to render those targets

rule prefetch:
     input: lambda wildcards: lookupSRR(wildcards.assembly)
     ...
rule mapping:
     params: xarg = lambda wildcards: getSRRxargs(wildcards.mapping)

You may need to throw in some "I'm done" sentinel flags to make this work.

ADD COMMENT • link 3.3 years ago by Jeremy Leipzig 22k

0

Entering edit mode

thank you very much, it helps a lot !

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

Hello @Jeremy Leipzig I used the code but now when I try to run the prefetch rule its says :

Missing input files for rule Download_reads:
SRR1799851
SRR2000538
SRR5867420
SRR1290861
...

Which is normal since the file does not exist, they will be written.

Here is the rule I used :

rule prefetch:
     input:
      read= lambda wildcards: lookupSRR('{species}'.format(species=wildcards.species))
     output:
      outfile="/beegfs/user/Genomes/{species}/Mapping/Download_run.log"
     shell:
      """
      /beegfs/user/sratoolkit.2.11.0-ubuntu64/bin/prefetch --max-size 100000000 {input.read} && /beegfs/user/sratoolkit.2.11.0-ubuntu64/bin/fasterq-dump -t /beegfs/data/user/Genomes/{$
      /beegfs/data/user/pigz --best /beegfs/data/user/Genomes/{wildcards.species}/Mapping/Transcriptomic_reads/{input.read}*\n")
      touch {output.outfile}
      """

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

ok yes sorry can you change input: to params: in your prefetch rule (and input.read becomes params.read)

ADD REPLY • link 3.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

ok thank you, now if I well understand the code will have only one job where it will download all reads at once right? How can I manage to create one job for each read? For Instance I have 13 reads to download and I would like to create 13 jobs. Because the code here provided a list of read names, so the prefetch command download one after another instead of downloading only one.

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

please put your whole snakefile up in github so we can see it

ADD REPLY • link 3.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Sure, here is the adresse : https://github.com/StromTroopers/biostar_ex/blob/main/README.md

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

Maybe you want to communicate in a chat instead ?

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

so prefetch is multithreaded, you just need to provide it a space-delimited list of accessions, so just write a function to return that for each set of SRR's you need to fetch for an assembly

ADD REPLY • link 3.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

yes but in order to speedup the process I would like to create one job (with mutliple core) for each read independently. Do you think it is possible?

ADD REPLY • link 3.2 years ago by Chvatil ▴ 130

0

Entering edit mode

yes but then the download rule cannot be looking up multiple files - just offload that to another rule or function and then simply ask for all the files in your all rule

ADD REPLY • link 3.2 years ago by Jeremy Leipzig 22k