Question: BBDuk output different sized paired end reads
1
gravatar for GLFrey
17 months ago by
GLFrey20
Montana
GLFrey20 wrote:

Hello,

I've used BBDuk in the past although not with a data sets this large (14-20GB). I've been following the preprocessing guide and after quality trimming, it initially appeared that some of my data sets had become unpaired as their output files were different sizes. I reran the process again and sure enough, the same samples still had R1 and R2 that were different sizes, with R1 being reported as 1GB larger than R2. I then ran vpair repair.sh on the samples which reported the names appeared to be correctly paired and then fastqc which reported an equal number of reads in each file. This has happened to 14 sets of my paired end reads with initial file sizes ranging from 15-20GB. I found it strange that it had only happened to a few samples, so when I changed my view from "ls -lh" to "ls -l --block-size=MB" (which if I'm understanding correctly shows my file size in MB) and it looks like all my files have been affected with all files reporting different file sizes for R1 and R2, but those reporting a 1GB change more so than the others.

So my question is, are my samples still correctly paired? I've pasted my bbduk command below:

for i in `ls -1 *CR_R1.fq | sed 's/CR_R1.fq//'`

do

bbduk.sh -Xmx28g in1=$i\CR_R1.fq in2=$i\CR_R2.fq out1=$i\QT_R1.fq out2=$i\QT_R2.fq k=31 tpe tbo qtrim=rl trimq=20 maq=20 maxns=0 minlen=50


done
bbmap bbduk preprocessing • 778 views
ADD COMMENTlink modified 17 months ago by h.mon27k • written 17 months ago by GLFrey20

Your command looks fine. Are you sure your input files are correctly paired to begin with? I would check on that first for the affected sets (i.e. run repair.sh on original files).

As long as the number of sequences is identical in files (pre- and post-trimming) it does not matter what their sizes are. File size is never a good QC metric for anything.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax73k

Hello Genomax,

I apologize it took so long for me to get back to you, it took a while for me to run them all. Repair.sh is reporting all of the raw files are indeed paired. I also ran it against the output of the other preprocessing steps and repair.sh reports they have remained correctly paired, even at the QTrim stages where they are reported as different sizes. So as you say above, I'm okay to continue with my analysis.

Many thanks for your help.

ADD REPLYlink written 17 months ago by GLFrey20

You are ok to continue with the analysis. There is no need to uncompress the fastq files (or leave the files uncompressed). You can keep files the files gzipped through entire process until alignment.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax73k
0
gravatar for h.mon
17 months ago by
h.mon27k
Brazil
h.mon27k wrote:

Some causes for R1 and R2 fastq files with different sizes:

1) if qualities are very different (for example, R2 average read quality is much lower than R1), after quality trimming, R2 average read length will be shorter than R1, thus different file sizes. Check original read qualities and post-trimming read length distribution for R1 and R2.

2) compressed file size (not you case, according to your bbduk command, but worth pointing out anyway) may differ even for files with same original size - just reordering the sequences inside a fastq file may reduce its compressed size to by a significant amount.

ADD COMMENTlink written 17 months ago by h.mon27k

Hello h.mon,

Thank you for the insight, that's really helpfull. My reads are a little different in quality distribution across the reads (R1 vs R2), but nothing major (both still above phred 28), do you think that would account for the major size difference in the files?

Many thanks for your help.

ADD REPLYlink written 17 months ago by GLFrey20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 778 users visited in the last hour