Using snakemake in order to download reads and run hisat2 mapping
1
0
Entering edit mode
3 months ago
Chvatil ▴ 60

Hello everyone, I would need help in order to download and run a hisat2 pipeline on snakemake.

In fact I have to run 3 different hisat2 mapping on 3 assemblies.

For that I have a dataframe where I have the reads IDs I want to map for each assembly name such as :

SRA_accession Assembly 
SRR1                 Assembly1
SRR2                 Assembly1
SRR3                 Assembly1
SRR1                 Assembly2
SRR2                 Assembly2
SRR3                 Assembly2
SRR1                 Assembly3
SRR2                 Assembly3

where we can find the assemblies here: /data/Genomes/AssemblyN/Assembly.fa

and SRA reads here for each assembly :

/data/Genomes/AsemblyN/reads/

So for now I use a python script in order to generate the following scripts (here it is an example for Assembly1)

#I download all the reads by doing: (Here the sra_accession are fake of course)

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR1 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR1 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR1* 

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR2 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR2-O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR2* 

/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR3 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f  SRR3 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR3* 

#Then I run the hisat2 soft

hisat2 --dta -k 1 -q -x mapping_index -1 SRR1_1.fastq.gz,SRR2_1.fastq.gz,SRR3_1.fastq.gz -2 SRR1_2.fastq.gz,SRR2_2.fastq.gz,SRR3_2.fastq.gz  | samtools view -o mapping_Assembly1.bam 2> stats_mapping.txt

But I wondered if someone had an idea in order to do it simply with a snakemake pipeline ? I do not know how to handle the fact that I cant to run specific reads for specific assemblies and how to add a list of read ids on Hisat2. I'm really new in this topic and it would be amazing if someone can help me on that.

Thank you very much and have a nice day.

hisat2 snakemake python • 464 views
ADD COMMENT
2
Entering edit mode
3 months ago

Read your dataframe into a pandas dataframe and create explicit targets from it (Assembly1, Assembly2)

Write lookup functions that are able to identify inputs from outputs

def lookupSRR(assembly){.... return list of SRRs}

def getSRRxargs(mapping){... return commasep SRRs}

Write recipes that are able to render those targets

rule prefetch:
     input: lambda wildcards: lookupSRR(wildcards.assembly)
     ...
rule mapping:
     params: xarg = lambda wildcards: getSRRxargs(wildcards.mapping)

You may need to throw in some "I'm done" sentinel flags to make this work.

ADD COMMENT
0
Entering edit mode

thank you very much, it helps a lot !

ADD REPLY
0
Entering edit mode

Hello @Jeremy Leipzig I used the code but now when I try to run the prefetch rule its says :

Missing input files for rule Download_reads:
SRR1799851
SRR2000538
SRR5867420
SRR1290861
...

Which is normal since the file does not exist, they will be written.

Here is the rule I used :

rule prefetch:
     input:
      read= lambda wildcards: lookupSRR('{species}'.format(species=wildcards.species))
     output:
      outfile="/beegfs/user/Genomes/{species}/Mapping/Download_run.log"
     shell:
      """
      /beegfs/user/sratoolkit.2.11.0-ubuntu64/bin/prefetch --max-size 100000000 {input.read} && /beegfs/user/sratoolkit.2.11.0-ubuntu64/bin/fasterq-dump -t /beegfs/data/user/Genomes/{$
      /beegfs/data/user/pigz --best /beegfs/data/user/Genomes/{wildcards.species}/Mapping/Transcriptomic_reads/{input.read}*\n")
      touch {output.outfile}
      """
ADD REPLY
0
Entering edit mode

ok yes sorry can you change input: to params: in your prefetch rule (and input.read becomes params.read)

ADD REPLY
0
Entering edit mode

ok thank you, now if I well understand the code will have only one job where it will download all reads at once right? How can I manage to create one job for each read? For Instance I have 13 reads to download and I would like to create 13 jobs. Because the code here provided a list of read names, so the prefetch command download one after another instead of downloading only one.

ADD REPLY
0
Entering edit mode

please put your whole snakefile up in github so we can see it

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Maybe you want to communicate in a chat instead ?

ADD REPLY
0
Entering edit mode

so prefetch is multithreaded, you just need to provide it a space-delimited list of accessions, so just write a function to return that for each set of SRR's you need to fetch for an assembly

ADD REPLY
0
Entering edit mode

yes but in order to speedup the process I would like to create one job (with mutliple core) for each read independently. Do you think it is possible?

ADD REPLY
0
Entering edit mode

yes but then the download rule cannot be looking up multiple files - just offload that to another rule or function and then simply ask for all the files in your all rule

ADD REPLY

Login before adding your answer.

Traffic: 1357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6