Question

Bash scripting FastQC for multiple fastq files in multiple directories

1

Entering edit mode

5.5 years ago

rc1253 ▴ 20

I am completely new to bioinformatics so I'm looking to learn how to do this.

I have multiple directories with fastq files: E.g; 10 Directories with each time series, each with Treatment and control directories, each with rep1 rep2 rep3.

For example: T9/Infected/Rep1/*.fastq.gz.

I'm looking to create a loop to run fastQC on each fastq file instead of having to submit a separate job for each directory.

Then to either output the fastQC data to a single directory or if possible a directory corresponding to each rep - e.g. rep1 results go into a folder called rep1 and so on

RNA-Seq bash fastQC • 13k views

ADD COMMENT • link updated 3.1 years ago by DareDevil ★ 4.3k • written 5.5 years ago by rc1253 ▴ 20

3

Entering edit mode

with gnu-parallel and try this (For fastq):

find . -name "*.fastq" | parallel --dry-run fastqc -o {//}/ {}

Example dry-run output is (for fastq files):

fastqc -o ./test1/ ./test1/test1.fastq
fastqc -o ./test2/ ./test2/test1.fastq

this would search all .fastq files and create output in each corresponding directory. Remove --dry-run option once you validate dummy run

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

Thanks for the response. Can you just help me clarify what you said because I'm a rookie at this: so does the '.' after find dictate the directory that the find command will look in? So I could put this as for example T9 and it will look for the fastq files in all the subdirectories in this directory? Then it will pass these files into the fastqc job?

ADD REPLY • link 5.5 years ago by rc1253 ▴ 20

1

Entering edit mode

find . -name "*.fastq"

. represents current directory. In current directory, look for files with fastq extension.

parallel --dry-run fastqc -o {//}/ {}

parallel is a function from GNU-Parallel program. --dry-run tells the program not to execute the program, but do a dummy run i.e print what commands will be executed. -o is for output. {} denotes input (could be any thing, but in this case output from find command..fastq files with file path). {//} is a function parameter within gnu-parallel to print only the path of the file, not the name of file or it's extension. / is simply /. No special meaning. {} is input.

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

The first argument after a find command is the directory to start looking in. . is shorthand for ‘my current working directory’.

It could just as easily read:

find /path/to/fastqs [options]

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

Thanks that makes sense.

Could you explain what the '{ //}' means on the path to output? What do those brackets mean?

My bash script so far:

module load fastqc

cd /path/to/directory/lettuce_bot_timeseries/data/reads/

find . -name "*.fastq.gz" | parallel fastqc -o ../../fastqanalysis

ADD REPLY • link 5.5 years ago by rc1253 ▴ 20

0

Entering edit mode

The parallel program has quite unconventional syntax. It would be worth googling some beginners tutorials and examples to really understand it, rather than just have us explain specifics (we will be happy to clarify things of course).

parallel is an invaluable tool to have in your toolkit, so it is well worth investing an hour or so now to learn the basics, and save yourself dozens of hours in future.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

What have you tried so far?

ADD REPLY • link 5.5 years ago by Devon Ryan 104k

3

Entering edit mode

Hint:

find and its -exec option will be your friend here. Alternatively ls or find piped to parallel will also work nicely.

You needn’t loop the directories, there are better ways :)

ADD REPLY • link 5.5 years ago by Joe 21k

score 5 · Answer 1 · 2021-03-19

5

Entering edit mode

3.1 years ago

DareDevil ★ 4.3k

fastqc supports parallel running. Say suppose you have 13 samples to run in parallel, you can use following command:

fastqc -t 13 *.fastq.gz

ADD COMMENT • link 3.1 years ago by DareDevil ★ 4.3k

0

Entering edit mode

is that true?

will it then not run each sample on 13 threads in stead of 13 samples each on 1 thread?

ADD REPLY • link 3.1 years ago by lieven.sterck 15k

1

Entering edit mode

It will run, at a time on 13 samples

ADD REPLY • link 3.1 years ago by DareDevil ★ 4.3k

0

Entering edit mode

so it does.

Quite confusing implementation though :)

ADD REPLY • link 3.1 years ago by lieven.sterck 15k

score 1 · Answer 2 · 2018-10-26

1

Entering edit mode

5.5 years ago

Paul ★ 1.5k

This could work: find -maxdepth 5 -name "*fastq.gz" | parallel fastqc {} -o qc/ , then create easy script to move output to corresponding directory.

ADD COMMENT • link 5.5 years ago by Paul ★ 1.5k