Question: How to speed up trimming using trim galorer
2
gravatar for lamia_203
5 months ago by
lamia_20330
lamia_20330 wrote:

I am currently trying to follow a scRNA-seq pipeline, however I have encountered a couple of problems.When trying to trim a paired-end read sample to remove reads with a quality score below 20 using trim galore, for one particular sample it is has been running for over 12 hours and has only done 4 out of the 19 paired end reads. Is there a reason for this, or any way that we can speed up this process? The fastq files are really large for these samples - some are over 5 million KB, and there are 19 paired reads files. Other samples worked fine before and other tools do not work properly, only trim galore works well.

 for i in *_1.fastq.gz;
    do​
    trim_galore 
                     -q 20 
                     --paired 
                     -o trimmed “$i” “${i%_1.fastq.gz}_2.fastq.gz“; 
    done
trim rna-seq fastq • 469 views
ADD COMMENTlink modified 5 months ago by mbk0asis430 • written 5 months ago by lamia_20330
2

other tools do not work properly, only trim galore works well

I find that hard to believe. There are threaded trimming tools (e.g. bbduk.sh from BBMap suite) that will work as fast as your disk I/O allows and number of cores you have available. That said if your computer is I/O bound (e.g. you are using a regular spinning disk) then things may already be at their peak limits.

In addition to the solution below you can also look into using parallel : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD REPLYlink written 5 months ago by genomax69k
2
gravatar for mbk0asis
5 months ago by
mbk0asis430
Korea, Republic Of
mbk0asis430 wrote:

Use GNU Parallel!

Your 'for-loop' code processes one sample at a time. (it waits until one sample is finished and starts next.)

Use 'parallel' to process multiple samples simultaneously!

If you have reads like this;

sample1.R1.fq.gz
sample1.R2.fq.gz

sample2.R1.fq.gz
sample2.R2.fq.gz

and so on

try something like below;

ls -1 *.fq.gz | cut -d. -f1 | sort | uniq | parallel -j 10 'trim_galore --paired {}.R1.fq.gz {}.R2.fq.gz'

note) change -j parameter to adjust the number of jobs you want to run simultaneously.
ADD COMMENTlink modified 5 months ago • written 5 months ago by mbk0asis430
2

Shorter:

parallel --plus 'trim_galore --paired {...}.R1.fq.gz {...}.R2.fq.gz' ::: *R1.fq.gz
ADD REPLYlink written 5 months ago by ole.tange3.4k

Much better! Thanks~

ADD REPLYlink written 5 months ago by mbk0asis430
2

Note that if you use GNU parallel, you're required to cite it in your publications: https://github.com/martinda/gnu-parallel/blob/master/CITATION, which is why my answer suggested xargs -P instead for a simple use case as this (in addition of being a standard tool on linux distribution, if you intend to redistribute your code).

However I'll say one advantage of parallel is that you can distribute your load to different machines using -S machine1.example.com,machine2.example.com,machine3.example.com, though that's a little bit more involved.

ADD REPLYlink written 5 months ago by manuel.belmadani920
1
gravatar for manuel.belmadani
5 months ago by
Canada
manuel.belmadani920 wrote:

Depending on your machine's memory and number of CPUs available, you could try to parallelize it. The for loop you have does them one by one. Check current usage with top, and see how many you can do in parallel. If you estimate that you can run 10 jobs at the same time, you could do something like:

ls *_1.fastq.gz | xargs -P10 -I@ bash -c 'trim_galore -q 20 --paired -o trimmed "$1" ${1%_1.*.*}_2.fastq.gz' _ @

Otherwise, see if there's any parameter you can use in your Trim Galore call (e.g. Possibly using --dont_gzip could be faster, and you can gzip your files later.)

More detail about the command: Instead of looping, it pipes the list of files to xargs which uses 10 parallel process (-P10) and substitutes the @ in the following command with each input file. To keep a similar shell substitution like you had, it's calling a subshell (bash -c) while _ is just a placeholder variable that gets set to $0 (the process name) and your input fastq.gz file becomes $1.

ADD COMMENTlink written 5 months ago by manuel.belmadani920
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1993 users visited in the last hour