How to speed up trimming using trim galorer
2
3
Entering edit mode
3.4 years ago
lamia_203 ▴ 100

I am currently trying to follow a scRNA-seq pipeline, however I have encountered a couple of problems.When trying to trim a paired-end read sample to remove reads with a quality score below 20 using trim galore, for one particular sample it is has been running for over 12 hours and has only done 4 out of the 19 paired end reads. Is there a reason for this, or any way that we can speed up this process? The fastq files are really large for these samples - some are over 5 million KB, and there are 19 paired reads files. Other samples worked fine before and other tools do not work properly, only trim galore works well.

 for i in *_1.fastq.gz;
do​
trim_galore
-q 20
--paired
-o trimmed “$i” “${i%_1.fastq.gz}_2.fastq.gz“;
done

RNA-Seq fastq trim • 5.9k views
2
Entering edit mode

other tools do not work properly, only trim galore works well

I find that hard to believe. There are threaded trimming tools (e.g. bbduk.sh from BBMap suite) that will work as fast as your disk I/O allows and number of cores you have available. That said if your computer is I/O bound (e.g. you are using a regular spinning disk) then things may already be at their peak limits.

In addition to the solution below you can also look into using parallel : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

3
Entering edit mode
3.4 years ago

Depending on your machine's memory and number of CPUs available, you could try to parallelize it. The for loop you have does them one by one. Check current usage with top, and see how many you can do in parallel. If you estimate that you can run 10 jobs at the same time, you could do something like:

ls *_1.fastq.gz | xargs -P10 -I@ bash -c 'trim_galore -q 20 --paired -o trimmed "$1"${1%_1.*.*}_2.fastq.gz' _ @


Otherwise, see if there's any parameter you can use in your Trim Galore call (e.g. Possibly using --dont_gzip could be faster, and you can gzip your files later.)

More detail about the command: Instead of looping, it pipes the list of files to xargs which uses 10 parallel process (-P10) and substitutes the @ in the following command with each input file. To keep a similar shell substitution like you had, it's calling a subshell (bash -c) while _ is just a placeholder variable that gets set to $0 (the process name) and your input fastq.gz file becomes $1.

0
Entering edit mode

This solved my long term problem. Thanks for this.

3
Entering edit mode
3.4 years ago
mbk0asis ▴ 640

Use GNU Parallel!

Your 'for-loop' code processes one sample at a time. (it waits until one sample is finished and starts next.)

Use 'parallel' to process multiple samples simultaneously!

If you have reads like this;

sample1.R1.fq.gz
sample1.R2.fq.gz

sample2.R1.fq.gz
sample2.R2.fq.gz

and so on


try something like below;

ls -1 *.fq.gz | cut -d. -f1 | sort | uniq | parallel -j 10 'trim_galore --paired {}.R1.fq.gz {}.R2.fq.gz'

note) change -j parameter to adjust the number of jobs you want to run simultaneously.

2
Entering edit mode

Shorter:

parallel --plus 'trim_galore --paired {...}.R1.fq.gz {...}.R2.fq.gz' ::: *R1.fq.gz

0
Entering edit mode

Much better! Thanks~

2
Entering edit mode

Note that if you use GNU parallel, you're required to cite it in your publications: https://github.com/martinda/gnu-parallel/blob/master/CITATION, which is why my answer suggested xargs -P instead for a simple use case as this (in addition of being a standard tool on linux distribution, if you intend to redistribute your code).

However I'll say one advantage of parallel is that you can distribute your load to different machines using -S machine1.example.com,machine2.example.com,machine3.example.com, though that's a little bit more involved.