Question: How to speed up trimming using trim galorer
2
gravatar for lamia_203
21 months ago by
lamia_20380
lamia_20380 wrote:

I am currently trying to follow a scRNA-seq pipeline, however I have encountered a couple of problems.When trying to trim a paired-end read sample to remove reads with a quality score below 20 using trim galore, for one particular sample it is has been running for over 12 hours and has only done 4 out of the 19 paired end reads. Is there a reason for this, or any way that we can speed up this process? The fastq files are really large for these samples - some are over 5 million KB, and there are 19 paired reads files. Other samples worked fine before and other tools do not work properly, only trim galore works well.

 for i in *_1.fastq.gz;
    do​
    trim_galore 
                     -q 20 
                     --paired 
                     -o trimmed “$i” “${i%_1.fastq.gz}_2.fastq.gz“; 
    done
trim rna-seq fastq • 2.3k views
ADD COMMENTlink modified 21 months ago by mbk0asis570 • written 21 months ago by lamia_20380
2

other tools do not work properly, only trim galore works well

I find that hard to believe. There are threaded trimming tools (e.g. bbduk.sh from BBMap suite) that will work as fast as your disk I/O allows and number of cores you have available. That said if your computer is I/O bound (e.g. you are using a regular spinning disk) then things may already be at their peak limits.

In addition to the solution below you can also look into using parallel : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD REPLYlink written 21 months ago by genomax91k
2
gravatar for mbk0asis
21 months ago by
mbk0asis570
Korea, Republic Of
mbk0asis570 wrote:

Use GNU Parallel!

Your 'for-loop' code processes one sample at a time. (it waits until one sample is finished and starts next.)

Use 'parallel' to process multiple samples simultaneously!

If you have reads like this;

sample1.R1.fq.gz
sample1.R2.fq.gz

sample2.R1.fq.gz
sample2.R2.fq.gz

and so on

try something like below;

ls -1 *.fq.gz | cut -d. -f1 | sort | uniq | parallel -j 10 'trim_galore --paired {}.R1.fq.gz {}.R2.fq.gz'

note) change -j parameter to adjust the number of jobs you want to run simultaneously.
ADD COMMENTlink modified 21 months ago • written 21 months ago by mbk0asis570
2

Shorter:

parallel --plus 'trim_galore --paired {...}.R1.fq.gz {...}.R2.fq.gz' ::: *R1.fq.gz
ADD REPLYlink written 21 months ago by ole.tange3.9k

Much better! Thanks~

ADD REPLYlink written 21 months ago by mbk0asis570
2

Note that if you use GNU parallel, you're required to cite it in your publications: https://github.com/martinda/gnu-parallel/blob/master/CITATION, which is why my answer suggested xargs -P instead for a simple use case as this (in addition of being a standard tool on linux distribution, if you intend to redistribute your code).

However I'll say one advantage of parallel is that you can distribute your load to different machines using -S machine1.example.com,machine2.example.com,machine3.example.com, though that's a little bit more involved.

ADD REPLYlink written 21 months ago by manuel.belmadani1.2k
1
gravatar for manuel.belmadani
21 months ago by
Canada
manuel.belmadani1.2k wrote:

Depending on your machine's memory and number of CPUs available, you could try to parallelize it. The for loop you have does them one by one. Check current usage with top, and see how many you can do in parallel. If you estimate that you can run 10 jobs at the same time, you could do something like:

ls *_1.fastq.gz | xargs -P10 -I@ bash -c 'trim_galore -q 20 --paired -o trimmed "$1" ${1%_1.*.*}_2.fastq.gz' _ @

Otherwise, see if there's any parameter you can use in your Trim Galore call (e.g. Possibly using --dont_gzip could be faster, and you can gzip your files later.)

More detail about the command: Instead of looping, it pipes the list of files to xargs which uses 10 parallel process (-P10) and substitutes the @ in the following command with each input file. To keep a similar shell substitution like you had, it's calling a subshell (bash -c) while _ is just a placeholder variable that gets set to $0 (the process name) and your input fastq.gz file becomes $1.

ADD COMMENTlink written 21 months ago by manuel.belmadani1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1664 users visited in the last hour