Question: Trim_galore on multiple fast.qz files: syntax problem
1
gravatar for m93
17 months ago by
m93170
m93170 wrote:

I was reading the following post about how to run trim_galore on multiple paired-end fastq.gz files TrimGalore! on multiple paired fastq files

I installed GNU parallel and intended to use the same command suggested by eldronzhou:

find  path_to_fastq  -name "*_R1_merged.fastq.gz" | cut -d "_" -f1 | parallel -j 1 trim_galore --illumina --paired --fastqc -o trim_galore/ {}\_R1_merged.fastq.gz {}\_R2_merged.fastq.gz

However, my files are named slighlty differently: XXX_XX_L008_R1_001.fastq.gz and XXX_XX_L008_R2_001.fastq.gz Therefore I changed to command above to the following:

find  path_to_fastq  -name "*_R1_001.fastq.gz" | cut -d "R" -f1 | parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {}R1_001.fastq.gz {}R2_001.fastq.gz

However, I get the following error (showing up once for each pair of *fast.gz files:

Please provide an even number of input files for paired-end FastQ trimming! Aborting ...

I'm guessing something is wrong in my syntax and somehow the order of the files I provide is wrong - maybe R1 and R2 are not given in the right pairs somehow? My *fastq.gz files are in a separate folder and I have 20 files (so 10 pairs).

I cannot work out what is wrong, any help would be deeply appreciated.

UPDATE

After running the following --dry-run command, I get the following output:

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970096_CGAAGGT_L008_R1_001.fastq.gz ../../fastq/E970096_CGAAGGT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970084_CCTTGTC_L008_R1_001.fastq.gz ../../fastq/E970084_CCTTGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970090_TCAGAAG_L008_R1_001.fastq.gz ../../fastq/E970090_TCAGAAG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970092_ACAGTAC_L008_R1_001.fastq.gz ../../fastq/E970092_ACAGTAC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970038_CTGGTTG_L008_R1_001.fastq.gz ../../fastq/E970038_CTGGTTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970094_AGGACTG_L008_R1_001.fastq.gz ../../fastq/E970094_AGGACTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970073_CTAGGTC_L008_R1_001.fastq.gz ../../fastq/E970073_CTAGGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970088_GGATCAT_L008_R1_001.fastq.gz ../../fastq/E970088_GGATCAT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970095_CAGTCAT_L008_R1_001.fastq.gz ../../fastq/E970095_CAGTCAT_L008_R2_001.fastq.gz

I still can't see an obvious mistake...

UPDATE2

I think I may have found the mistake.. I think that somehow, when using GNU parallel, I end up with commands (see just above) lacking the quotes around "--outdir ..."! And so I think the answer is:

find ../../fastq/ -name "*R1_001.fastq.gz" | cut -d "R" -f1 | parallel --dry-run -j 1 trim_galore --paired --fastqc_args \"--outdir /home/user/my_projects/project1/data/qc/trim-galore\" {}R1_001.fastq.gz {}R2_001.fastq.gz

# Output command I want (one example only)
trim_galore --paired --fastqc_args "--outdir /home/user/my_projects/project1/data/qc/trim-galore" ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz
nsg trim_galore gnu parallel • 1.9k views
ADD COMMENTlink modified 17 months ago • written 17 months ago by m93170

did you count the number of files in path_to_fastq directory? does the folder contain matching R1 and R2 files?

ADD REPLYlink written 17 months ago by cpad011212k

Yes, there are 20, so they are in 10 pairs. The folder contains a .txt and a .sha1 file but surely, given my command above, that should not be a problem?

ADD REPLYlink written 17 months ago by m93170
1

syntax seems to be fine by me after checking few dummy files. Add --dry-run immediately after parallel command. Check the dummy run.

ADD REPLYlink written 17 months ago by cpad011212k

This is so bizarre.. The --dry-run returns that my files are in the right pairs!

ADD REPLYlink written 17 months ago by m93170

You would have completed the trim runs by now if you had run them serially :-)

ADD REPLYlink written 17 months ago by genomax75k

Well true haha but I intend to run this on over 100 samples eventually so I need to know how to do this. I seriously cannot understand what I'm doing wrong

ADD REPLYlink written 17 months ago by m93170

@ole.tange is developer of parallel so the answer below should work.

ADD REPLYlink written 17 months ago by genomax75k

Why do you need find path_to_fastq -name "*_R1_001.fastq.gz"? A simple ls -1 *_R1_001.fastq.gz should do. Make sure ls -1 *_R1_001.fastq.gz | wc -l gets an equal number as ls -1 *_R2_001.fastq.gz | wc -l.

ADD REPLYlink written 17 months ago by genomax75k

There are definitely the right number of files when I do those checks. I think its something to do with my find command which is not listing the R1 files in the same order as in the folder.. I tried replacing find with ls -1 but I have the same problem. This is so confusing

ADD REPLYlink modified 17 months ago • written 17 months ago by m93170

In general, for most of the tools, outdirs/ouputs are never quoted. I am not sure trimgalore requirements. I think first you should run the program with barebones command. For eg. remove --fastqc_args in function above.

ADD REPLYlink written 17 months ago by cpad011212k
1
gravatar for ole.tange
17 months ago by
ole.tange3.6k
Denmark
ole.tange3.6k wrote:

This:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {= s/_R1_/_R2_/ =}

or:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_}

will run:

trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaci_R1_001.fastq.gz path_to_fastq/abaci_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvarks_R1_001.fastq.gz path_to_fastq/aardvarks_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/a_R1_001.fastq.gz path_to_fastq/a_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvark_R1_001.fastq.gz path_to_fastq/aardvark_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalones_R1_001.fastq.gz path_to_fastq/abalones_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaft_R1_001.fastq.gz path_to_fastq/abaft_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacus_R1_001.fastq.gz path_to_fastq/abacus_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacuses_R1_001.fastq.gz path_to_fastq/abacuses_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aback_R1_001.fastq.gz path_to_fastq/aback_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalone_R1_001.fastq.gz path_to_fastq/abalone_R2_001.fastq.gz

If that does not work out of the box, try running each command by hand one at a time.

ADD COMMENTlink modified 17 months ago • written 17 months ago by ole.tange3.6k

is it necessary to have find line over here? can we not use:

parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_} ::: path_to_fastq/*_R1_001.fastq.gz

ADD REPLYlink modified 17 months ago • written 17 months ago by cpad011212k

find is used because OP used find.

Your solution will often give the same result, but will fail if the files are in subdirs inside path_to_fastq or if there are so many that they do not fit on a single command line.

ADD REPLYlink written 17 months ago by ole.tange3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 904 users visited in the last hour