Hey Guys,
I have a question regarding RNA seq with cleaned data that has the following done by a company: Firstly, Cutadapt [1] and perl scripts in house were used to remove the reads that contained adaptor contamination, low quality bases and undetermined bases. Then sequence quality was verified using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
I am currently performing HISAT2 to map reads to the genome of ftp://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/. However, I am running into issue regarding the actual parameters of HISTA2 either the mapped counts are too low or the sbatch will time out.
The following is the parameter set
#!/bin/bash
#SBATCH --ntasks=56
#SBATCH --cpus-per-task=4
#SBATCH --time=96:00:00
#SBATCH --mem=160000
#SBATCH --job-name=RNAseq_Pipeline
#SBATCH --output=logs/slurm-%j.out
#SBATCH --error=logs/slurm-%j.err
module purge module add hisat2/2.2.1 HISAT2_PARAMS="--threads 8 \
--dta \
--no-unal \
--no-mixed \
--no-discordant \
--score-min L,0,-0.1 \
--mp 2,1 \
--rdg 5,1 \
--rfg 5,1 \
--max-seeds 20 \
--phred33 \
--ignore-quals \
--add-chrname \
--summary-file"
Any suggestions? Is the parameter too liberal or is it simply the amount of data I have in my 9 samples?
You mention "counts are too low", why do you think so? can you put a number on it?
Does the SBATCH script runs but times-out or is it stuck in the queue 'and thus not even start' ?
at first sight there might be a syntax issue in your sbatch script:
you'll need a
;
(semi-colon) or&&
in betweenpurge
and (2nd)module
. These are two separate commands so you'll need to indicate it as suchI also question your resource request: you ask for 56 tasks with each 4 cores/threads apparently. Do you have that kind of resources available in your HPC system? Also, you only ask to run HISAT2 on 8 threads (which is fine, more will likely not add much) so perhaps consider reducing your sbatch resource requests.
Hi,
Thank you for your comment.
Regarding your question about the "counts are too low," I’m working with the cleaned data provided by the company and attempting to replicate their sample count for this sample. I want to establish a working code to repeat to the rest of the samples. However, I’m losing approximately 50% of the mapped counts in the process.
Just to clarify, the module purge is handled separately in my code.
As for the task, the HPC system has sufficient resources. The only issue I encountered is the cancellation of my Slurm job due to the time limit: JOB 199685 ON c141601 CANCELLED AT 2025-05-05T10:46:07 DUE TO TIME LIMIT.
OK, good to hear on the command syntax (and HPC requests) !
I see, best would be to contact the company and ask for the details of their mapping process. Using different aligner and/or parameter settings will likely affect the outcome but that it differs with nearly 50% is a bit too much to solely assign it to those kind of variables.
Also make sure to use the same version of the genome etc.
You mention your job reaches the wall time limit, but with what kind of parameter settings is that then? 4 days should be more than enough to reach the end with any given parameter settings (and even input), especially given the resources your request.