Question

Does GNU parallel cause problems for Salmon mapping (RNA-seq)?

0

Entering edit mode

2.8 years ago

skjw1029 ▴ 70

I have a few thousand FastQ files from a single cell RNA-seq experiment that I want to align and quantify with Salmon. (Paired-end, Illumina Smart-Seq2, Human)

Because the file names vary, I thought using GNU parallel to feed the files to salmon would be simpler than creating a for loop script (I have limited experience with bash and regex).

This was my code (slurm stuff excluded):

module load parallel
module load salmon

parallel salmon quant -i /work/InternalMedicine/s184335/genome_folder/alias/hg38/salmon_sa_index/default/. -l A --validateMappings --gcBias --seqBias --threads 48 -o MDS_salmon_pseudoalignmentandquant -1 {} -2 {=s/_R1_001_val_1/_R2_001_val_2/=} ::: *_R1_001_val_1.fq.gz

Now, when I take a look at the output file, for most of the samples, I get something like this:

where the program reloads salmon before it finishes analyzing the previous sample, and moves on. In some cases, I will also get an error message saying salmon quant was invoked incorrectly.

Salmon quant output file

Salmon quant exception message

Strangely though, towards the end of the output file, it shows several samples that were successfully mapped and quantified by salmon (I think). However, if I go to the results folder, there is only one quant.sf file for one sample.

One last thing of note is that, when I submit the batch script (I'm using an HPC) the job gets interrupted every so often because of a node fail.

I have tried running salmon quant with only one sample and that seems to work okay.

Could GNU parallel somehow be causing these problems? Perhaps my script is problematic and makes Salmon prematurely move on to the next sample, without mapping/writing the results?

I'm quite lost as I can't see an obvious error message and I'm rather new to bioinformatics. Would appreciate any help.

Salmon parallel • 1.2k views

ADD COMMENT • link 2.8 years ago by skjw1029 ▴ 70

score 3 · Accepted Answer · 2021-06-30

3

Entering edit mode

2.8 years ago

GenoMax 141k

Strangely though, towards the end of the output file, it shows several samples that were successfully mapped and quantified by salmon (I think). However, if I go to the results folder, there is only one quant.sf file for one sample.

and

Perhaps my script is problematic and makes Salmon prematurely move on to the next sample, without mapping/writing the results?

Since you are using a single output folder for all jobs -o MDS_salmon_pseudoalignmentandquant each job is (over)writing files in the same folder. salmon is one of those programs that produces output files that have generic (and identical) names for every sample processed. Please set a new folder for each pair of files processed to hold the output. That should fix this problem. I am not a parallel user so you may need to look up how you can specify a variable output folder name in that command.

I don't know how these files are related but if some of these are biological replicates then you can specify more than one set of files as input (see the section on "Providing multiple read files to salmon" LINK).

salmon version you are using is old so please upgrade as the message suggests.

ADD COMMENT • link 2.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the answer. I'll try to figure out how to set up a new folder for each pair of files processed using parallel.

But I'm wondering how this will also solve the problem of salmon moving on to the next pair of files before mapping and quantifying the previous pair.

As noted and pictured above:

where the program reloads salmon before it finishes analyzing the previous sample, and moves on.

Specifying a variable output folder will probably solve the issue with only one quant.sf file but I'm not sure how it affects salmon not mapping and quantifying most of the samples. It doesn't even seem to read the pair of files, stopping after it loads the reference transcriptome.

Also, the files are not biological replicates. They are just paired reads from each single cell, so I'm passing them to salmon two by two, with each sample being a single cell.

ADD REPLY • link 2.8 years ago by skjw1029 ▴ 70

2

Entering edit mode

how this will also solve the problem of salmon moving on to the next pair of files before mapping and quantifying the previous pair.

My hunch is that as soon as a new job starts it immediately overwrites the folder since you are using the same name for all jobs. Prior result data is thus lost and you end up with quant results of the last sample processed at end. When this is not the case e.g. Run1_out is folder for Run1 , Run2_out is folder for Run2 there would be no contention with directory names and writes happening to those directories.

salmon is fast enough that you could simply submit independent SLURM jobs and not use parallel at all. Something like this

for i in `ls -1 *R1*`; do name=$(basename ${i} _R1.fastq.gz); \
dname=$(dirname ${i}); echo ${dname}; echo ${name}; \
sbatch -p partition -t 1-0 -n 8 -N 1 --mem=30g -o ${name}_%J_salmon.out --wrap="salmon quant -i salmon_index/default -l A -1 ${dname}/${name}_R1.fastq.gz -2 ${dname}/${name}_R2.fastq.gz -p 8 --validateMappings -o ${name}_quant"; done

This will get you a separate output directory (with name of the sample) for every sample.

ADD REPLY • link 2.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks mate, it looks like that was indeed the problem.

Also, thank you for the example code! I made it a single job since I didn't want to run thousands of individual jobs.

On a separate note, I really need to properly learn bash and reg ex.

Below is the final job script that I ran:

#!/bin/bash                                                                                                                                                                                                                                                                             
#Salmon pseudoalignment and quantification!!                                                                                                                                                                                                                                            

#SBATCH -J MDS_salmon        # Job name                                                                                                                                                                                                                                                 
#SBATCH -o MDS_salmon.o%j # stdout output file                                                                                                                                                                                                                                          
#SBATCH -e MDS_salmon.o%j # stderr output file                                                                                                                                                                                                                                          
#SBATCH -N 1                    # Number of nodes requested                                                                                                                                                                                                                             
#SBATCH --ntasks=1              # number of mpi tasks/threads requested                                                                                                                                                                                                                 
#SBATCH --cpus-per-task=32      # number of cpus/cores dedicated to a single task                                                                                                                                                                                                       
#SBATCH -t 10:00:00             # Run time (hh:mm:ss) - 10 hours                                                                                                                                                                     
#SBATCH --partition=256GB


module load salmon


for i in `ls -1 *R1*`; do name=$(basename ${i} _R1_001_val_1.fq.gz); \
dname=$(dirname ${i}); echo ${dname}; echo ${name}; \
salmon quant -i /work/InternalMedicine/s184335/genome_folder/alias/hg38/salmon_sa_index/default/. -l A -1 ${name}_R1_001_val_1.fq.gz -2 ${name}_R2_001_val_2.fq.gz --threads 48 --validateMappings --gcBias --seqBias -o ${name}_quant; done

I will try to do the same with parallel and update for posterity.

ADD REPLY • link 2.8 years ago by skjw1029 ▴ 70