Question

BBDuk output different sized paired end reads

1

Entering edit mode

6.0 years ago

GLFrey ▴ 20

Hello,

I've used BBDuk in the past although not with a data sets this large (14-20GB). I've been following the preprocessing guide and after quality trimming, it initially appeared that some of my data sets had become unpaired as their output files were different sizes. I reran the process again and sure enough, the same samples still had R1 and R2 that were different sizes, with R1 being reported as 1GB larger than R2. I then ran vpair repair.sh on the samples which reported the names appeared to be correctly paired and then fastqc which reported an equal number of reads in each file. This has happened to 14 sets of my paired end reads with initial file sizes ranging from 15-20GB. I found it strange that it had only happened to a few samples, so when I changed my view from "ls -lh" to "ls -l --block-size=MB" (which if I'm understanding correctly shows my file size in MB) and it looks like all my files have been affected with all files reporting different file sizes for R1 and R2, but those reporting a 1GB change more so than the others.

So my question is, are my samples still correctly paired? I've pasted my bbduk command below:

for i in `ls -1 *CR_R1.fq | sed 's/CR_R1.fq//'`

do

bbduk.sh -Xmx28g in1=$i\CR_R1.fq in2=$i\CR_R2.fq out1=$i\QT_R1.fq out2=$i\QT_R2.fq k=31 tpe tbo qtrim=rl trimq=20 maq=20 maxns=0 minlen=50


done

bbmap bbduk preprocessing • 2.9k views

ADD COMMENT • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by GLFrey ▴ 20

0

Entering edit mode

Your command looks fine. Are you sure your input files are correctly paired to begin with? I would check on that first for the affected sets (i.e. run repair.sh on original files).

As long as the number of sequences is identical in files (pre- and post-trimming) it does not matter what their sizes are. File size is never a good QC metric for anything.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Hello Genomax,

I apologize it took so long for me to get back to you, it took a while for me to run them all. Repair.sh is reporting all of the raw files are indeed paired. I also ran it against the output of the other preprocessing steps and repair.sh reports they have remained correctly paired, even at the QTrim stages where they are reported as different sizes. So as you say above, I'm okay to continue with my analysis.

Many thanks for your help.

ADD REPLY • link 6.0 years ago by GLFrey ▴ 20

0

Entering edit mode

You are ok to continue with the analysis. There is no need to uncompress the fastq files (or leave the files uncompressed). You can keep files the files gzipped through entire process until alignment.

ADD REPLY • link 6.0 years ago by GenoMax 141k

score 1 · Answer 1 · 2018-05-05

1

Entering edit mode

6.0 years ago

h.mon 35k

Some causes for R1 and R2 fastq files with different sizes:

1) if qualities are very different (for example, R2 average read quality is much lower than R1), after quality trimming, R2 average read length will be shorter than R1, thus different file sizes. Check original read qualities and post-trimming read length distribution for R1 and R2.

2) compressed file size (not you case, according to your bbduk command, but worth pointing out anyway) may differ even for files with same original size - just reordering the sequences inside a fastq file may reduce its compressed size to by a significant amount.

ADD COMMENT • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Hello h.mon,

Thank you for the insight, that's really helpfull. My reads are a little different in quality distribution across the reads (R1 vs R2), but nothing major (both still above phred 28), do you think that would account for the major size difference in the files?

Many thanks for your help.

ADD REPLY • link 6.0 years ago by GLFrey ▴ 20