Question: BBDuk output different sized paired end reads
gravatar for GLFrey
23 months ago by
GLFrey20 wrote:


I've used BBDuk in the past although not with a data sets this large (14-20GB). I've been following the preprocessing guide and after quality trimming, it initially appeared that some of my data sets had become unpaired as their output files were different sizes. I reran the process again and sure enough, the same samples still had R1 and R2 that were different sizes, with R1 being reported as 1GB larger than R2. I then ran vpair on the samples which reported the names appeared to be correctly paired and then fastqc which reported an equal number of reads in each file. This has happened to 14 sets of my paired end reads with initial file sizes ranging from 15-20GB. I found it strange that it had only happened to a few samples, so when I changed my view from "ls -lh" to "ls -l --block-size=MB" (which if I'm understanding correctly shows my file size in MB) and it looks like all my files have been affected with all files reporting different file sizes for R1 and R2, but those reporting a 1GB change more so than the others.

So my question is, are my samples still correctly paired? I've pasted my bbduk command below:

for i in `ls -1 *CR_R1.fq | sed 's/CR_R1.fq//'`

do -Xmx28g in1=$i\CR_R1.fq in2=$i\CR_R2.fq out1=$i\QT_R1.fq out2=$i\QT_R2.fq k=31 tpe tbo qtrim=rl trimq=20 maq=20 maxns=0 minlen=50

bbmap bbduk preprocessing • 971 views
ADD COMMENTlink modified 23 months ago by h.mon29k • written 23 months ago by GLFrey20

Your command looks fine. Are you sure your input files are correctly paired to begin with? I would check on that first for the affected sets (i.e. run on original files).

As long as the number of sequences is identical in files (pre- and post-trimming) it does not matter what their sizes are. File size is never a good QC metric for anything.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax80k

Hello Genomax,

I apologize it took so long for me to get back to you, it took a while for me to run them all. is reporting all of the raw files are indeed paired. I also ran it against the output of the other preprocessing steps and reports they have remained correctly paired, even at the QTrim stages where they are reported as different sizes. So as you say above, I'm okay to continue with my analysis.

Many thanks for your help.

ADD REPLYlink written 23 months ago by GLFrey20

You are ok to continue with the analysis. There is no need to uncompress the fastq files (or leave the files uncompressed). You can keep files the files gzipped through entire process until alignment.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax80k
gravatar for h.mon
23 months ago by
h.mon29k wrote:

Some causes for R1 and R2 fastq files with different sizes:

1) if qualities are very different (for example, R2 average read quality is much lower than R1), after quality trimming, R2 average read length will be shorter than R1, thus different file sizes. Check original read qualities and post-trimming read length distribution for R1 and R2.

2) compressed file size (not you case, according to your bbduk command, but worth pointing out anyway) may differ even for files with same original size - just reordering the sequences inside a fastq file may reduce its compressed size to by a significant amount.

ADD COMMENTlink written 23 months ago by h.mon29k

Hello h.mon,

Thank you for the insight, that's really helpfull. My reads are a little different in quality distribution across the reads (R1 vs R2), but nothing major (both still above phred 28), do you think that would account for the major size difference in the files?

Many thanks for your help.

ADD REPLYlink written 23 months ago by GLFrey20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1793 users visited in the last hour