Hello,
I've used BBDuk in the past although not with a data sets this large (14-20GB). I've been following the preprocessing guide and after quality trimming, it initially appeared that some of my data sets had become unpaired as their output files were different sizes. I reran the process again and sure enough, the same samples still had R1 and R2 that were different sizes, with R1 being reported as 1GB larger than R2. I then ran vpair repair.sh on the samples which reported the names appeared to be correctly paired and then fastqc which reported an equal number of reads in each file. This has happened to 14 sets of my paired end reads with initial file sizes ranging from 15-20GB. I found it strange that it had only happened to a few samples, so when I changed my view from "ls -lh" to "ls -l --block-size=MB" (which if I'm understanding correctly shows my file size in MB) and it looks like all my files have been affected with all files reporting different file sizes for R1 and R2, but those reporting a 1GB change more so than the others.
So my question is, are my samples still correctly paired? I've pasted my bbduk command below:
for i in `ls -1 *CR_R1.fq | sed 's/CR_R1.fq//'`
do
bbduk.sh -Xmx28g in1=$i\CR_R1.fq in2=$i\CR_R2.fq out1=$i\QT_R1.fq out2=$i\QT_R2.fq k=31 tpe tbo qtrim=rl trimq=20 maq=20 maxns=0 minlen=50
done
Your command looks fine. Are you sure your input files are correctly paired to begin with? I would check on that first for the affected sets (i.e. run
repair.sh
on original files).As long as the number of sequences is identical in files (pre- and post-trimming) it does not matter what their sizes are. File size is never a good QC metric for anything.
Hello Genomax,
I apologize it took so long for me to get back to you, it took a while for me to run them all. Repair.sh is reporting all of the raw files are indeed paired. I also ran it against the output of the other preprocessing steps and repair.sh reports they have remained correctly paired, even at the QTrim stages where they are reported as different sizes. So as you say above, I'm okay to continue with my analysis.
Many thanks for your help.
You are ok to continue with the analysis. There is no need to uncompress the fastq files (or leave the files uncompressed). You can keep files the files gzipped through entire process until alignment.