Bash scripting FastQC for multiple fastq files in multiple directories
2
1
Entering edit mode
4.1 years ago
rc1253 ▴ 20

I am completely new to bioinformatics so I'm looking to learn how to do this.

I have multiple directories with fastq files: E.g; 10 Directories with each time series, each with Treatment and control directories, each with rep1 rep2 rep3.

For example: T9/Infected/Rep1/*.fastq.gz.

I'm looking to create a loop to run fastQC on each fastq file instead of having to submit a separate job for each directory.

Then to either output the fastQC data to a single directory or if possible a directory corresponding to each rep - e.g. rep1 results go into a folder called rep1 and so on

RNA-Seq bash fastQC • 10k views
3
Entering edit mode

with gnu-parallel and try this (For fastq):

find . -name "*.fastq" | parallel --dry-run fastqc -o {//}/ {}


Example dry-run output is (for fastq files):

fastqc -o ./test1/ ./test1/test1.fastq
fastqc -o ./test2/ ./test2/test1.fastq


this would search all .fastq files and create output in each corresponding directory. Remove --dry-run option once you validate dummy run

0
Entering edit mode

Thanks for the response. Can you just help me clarify what you said because I'm a rookie at this: so does the '.' after find dictate the directory that the find command will look in? So I could put this as for example T9 and it will look for the fastq files in all the subdirectories in this directory? Then it will pass these files into the fastqc job?

1
Entering edit mode

find . -name "*.fastq"

. represents current directory. In current directory, look for files with fastq extension.

parallel --dry-run fastqc -o {//}/ {}

parallel is a function from GNU-Parallel program. --dry-run tells the program not to execute the program, but do a dummy run i.e print what commands will be executed. -o is for output. {} denotes input (could be any thing, but in this case output from find command..fastq files with file path). {//} is a function parameter within gnu-parallel to print only the path of the file, not the name of file or it's extension. / is simply /. No special meaning. {} is input.

0
Entering edit mode

The first argument after a find command is the directory to start looking in. . is shorthand for ‘my current working directory’.

It could just as easily read:

find /path/to/fastqs [options]

1
Entering edit mode

Thanks that makes sense.

Could you explain what the '{ //}' means on the path to output? What do those brackets mean?

My bash script so far:

find . -name "*.fastq.gz" | parallel fastqc -o ../../fastqanalysis

0
Entering edit mode

The parallel program has quite unconventional syntax. It would be worth googling some beginners tutorials and examples to really understand it, rather than just have us explain specifics (we will be happy to clarify things of course).

parallel is an invaluable tool to have in your toolkit, so it is well worth investing an hour or so now to learn the basics, and save yourself dozens of hours in future.

0
Entering edit mode

What have you tried so far?

3
Entering edit mode

Hint:

find and its -exec option will be your friend here. Alternatively ls or find piped to parallel will also work nicely.

You needn’t loop the directories, there are better ways :)

3
Entering edit mode
20 months ago
dare_devil ★ 1.7k

fastqc supports parallel running. Say suppose you have 13 samples to run in parallel, you can use following command:

fastqc -t 13 *.fastq.gz

0
Entering edit mode

is that true?

will it then not run each sample on 13 threads in stead of 13 samples each on 1 thread?

0
Entering edit mode

It will run, at a time on 13 samples

0
Entering edit mode

so it does.

Quite confusing implementation though :)

1
Entering edit mode
4.1 years ago
Paul ★ 1.5k

This could work: find -maxdepth 5 -name "*fastq.gz" | parallel fastqc {} -o qc/ , then create easy script to move output to corresponding directory.