Question

HISAT2 with Cleaned Data

0

Entering edit mode

4 months ago

Gordonz9494 ▴ 10

Hey Guys,

I have a question regarding RNA seq with cleaned data that has the following done by a company: Firstly, Cutadapt [1] and perl scripts in house were used to remove the reads that contained adaptor contamination, low quality bases and undetermined bases. Then sequence quality was verified using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

I am currently performing HISAT2 to map reads to the genome of ftp://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/. However, I am running into issue regarding the actual parameters of HISTA2 either the mapped counts are too low or the sbatch will time out.

The following is the parameter set

#!/bin/bash
#SBATCH --ntasks=56
#SBATCH --cpus-per-task=4
#SBATCH --time=96:00:00
#SBATCH --mem=160000
#SBATCH --job-name=RNAseq_Pipeline
#SBATCH --output=logs/slurm-%j.out
#SBATCH --error=logs/slurm-%j.err

module purge module add hisat2/2.2.1 HISAT2_PARAMS="--threads 8 \
              --dta \
              --no-unal \
              --no-mixed \
              --no-discordant \
              --score-min L,0,-0.1 \
              --mp 2,1 \
              --rdg 5,1 \
              --rfg 5,1 \
              --max-seeds 20 \
              --phred33 \
              --ignore-quals \
              --add-chrname \
              --summary-file"

Any suggestions? Is the parameter too liberal or is it simply the amount of data I have in my 9 samples?

HISAT2 RNASeq • 658 views

ADD COMMENT • link updated 4 months ago by lieven.sterck 15k • written 4 months ago by Gordonz9494 ▴ 10

0

Entering edit mode

You mention "counts are too low", why do you think so? can you put a number on it?

Does the SBATCH script runs but times-out or is it stuck in the queue 'and thus not even start' ?

at first sight there might be a syntax issue in your sbatch script:

module purge module add hisat2/2.2.1 HISAT2_PARAMS

you'll need a ; (semi-colon) or && in between purge and (2nd) module . These are two separate commands so you'll need to indicate it as such

I also question your resource request: you ask for 56 tasks with each 4 cores/threads apparently. Do you have that kind of resources available in your HPC system? Also, you only ask to run HISAT2 on 8 threads (which is fine, more will likely not add much) so perhaps consider reducing your sbatch resource requests.

ADD REPLY • link 4 months ago by lieven.sterck 15k

1

Entering edit mode

Hi,

Thank you for your comment.

Regarding your question about the "counts are too low," I’m working with the cleaned data provided by the company and attempting to replicate their sample count for this sample. I want to establish a working code to repeat to the rest of the samples. However, I’m losing approximately 50% of the mapped counts in the process.

Just to clarify, the module purge is handled separately in my code.

As for the task, the HPC system has sufficient resources. The only issue I encountered is the cancellation of my Slurm job due to the time limit: JOB 199685 ON c141601 CANCELLED AT 2025-05-05T10:46:07 DUE TO TIME LIMIT.

ADD REPLY • link 4 months ago by Gordonz9494 ▴ 10

0

Entering edit mode

OK, good to hear on the command syntax (and HPC requests) !

I see, best would be to contact the company and ask for the details of their mapping process. Using different aligner and/or parameter settings will likely affect the outcome but that it differs with nearly 50% is a bit too much to solely assign it to those kind of variables.

Also make sure to use the same version of the genome etc.

You mention your job reaches the wall time limit, but with what kind of parameter settings is that then? 4 days should be more than enough to reach the end with any given parameter settings (and even input), especially given the resources your request.

ADD REPLY • link 4 months ago by lieven.sterck 15k