Question: FastQC on multiple paired end files
0
gravatar for evelyn
4 months ago by
evelyn90
evelyn90 wrote:

Hello,

I want to run FastQC on multiple FastQ files using an array.

fastqc -o ${OUT_DIR}/${SAMPLE}.fastqc.out -f ${INPUT_DIR}/${SAMPLE}.fastq

However, I could not find how to run fastqc on paired end files. e.g., can one fastqc report be created for file_1.fastq and file_2.fastq. If not, what are the options? I have 10 paired end files. I was planning to run fastqc on each of those and get one file using MultiQC. Is there a better way to do this. Thank you!

rna-seq • 456 views
ADD COMMENTlink written 4 months ago by evelyn90
2
gravatar for RamRS
4 months ago by
RamRS27k
Houston, TX
RamRS27k wrote:

AFAIK fastqc does not have a PE mode as the metrics it calculates are file-specific. You can read this post for a previous discussion on this topic.

I'd go the route you're thinking of, where you run fastqc on each individual FASTQ file and then multiqc the reports. BTW, fastqc -t 8 will process 8 files in parallel, so you may wish to use the -t option to get the job done quicker.

ADD COMMENTlink modified 4 months ago • written 4 months ago by RamRS27k

Thank you for the information. I ran fastqc on single file and it went well but it is giving an error if I use this array:

INPUT_DIR=/path/
OUT_DIR=/path/
RUN=${SLURM_ARRAY_TASK_ID}
INPUT=$(ls -1 $INPUT_DIR/*.fastq | sed -n ${RUN}p)
SAMPLE=$(basename ${INPUT} | sed 's/.fastq//')
fastqc -o ${OUT_DIR}/${SAMPLE}.fastqc.out -f ${INPUT_DIR}/${SAMPLE}.fastq

Error is: Specified output directory '' does not exist

ADD REPLYlink written 4 months ago by evelyn90

-o should be an existing directory, I think. Add an mkdir -p ${OUT_DIR}/${SAMPLE}.fastqc.out and you'll be all set.

I'd recommend against per-sample output directories though, as fastqc outputs an HTML file and a zip file per FASTQ file, and multiqc needs all the output zips to be in the same directory. So, unless you wish to create a new dir and move/soft-link all zip files, go with -o $OUT_DIR

Also, you can just basename $INPUT .fastq and skip the sed.

ADD REPLYlink modified 4 months ago • written 4 months ago by RamRS27k

Thank you for the help. I made the changes but it still shows some error:

Unrecognised sequence format 'file1_2.fastq', acceptable formats are bam,sam,bam_mapped,sam_mapped and fastq

These are fastq sequences for RNA seq from sra. I am not sure why it complaints about the format.

ADD REPLYlink written 4 months ago by evelyn90

Please read the manual. Your command line is wrong. Usage is as follows:

fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN
ADD REPLYlink written 4 months ago by RamRS27k

I found a very simple way: instead of an array I just used fastqc -t 8 *.fastq -o /path/ and it worked. Thank you for the help!

ADD REPLYlink modified 4 months ago • written 4 months ago by evelyn90

Glad it worked. Can you see the error in your previous command line? You were using -f (the parameter that accepts data format) for the input files. fastqc does not have named parameters for input files, just positional parameters.

ADD REPLYlink written 4 months ago by RamRS27k

yes I got that. Thank you for the catch!

ADD REPLYlink written 4 months ago by evelyn90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1725 users visited in the last hour