Question: BBDuk trimming concern: large discarded, too aggressive (?)
gravatar for umn_bist
5.0 years ago by
umn_bist380 wrote:

I am in piping PE RNA-Seq files for pre-processing QC (for somatic variant calling downstream).

I use exclusively BBDuk for adapter/quality trimming and prinseq for poly-A/T trimming.

My current configuration for BBDuk is:

#${BBDuk} in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}" ref="${adapter}" trimq=10 qtrim=r ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=40 outm=discarded.fastq

My discarded.fastq are close to 600-700 Mb in size and was curious if this is to be expected. My files are fairly high quality RNAseq files from TCGA.

rna-seq bbduk • 2.8k views
ADD COMMENTlink written 5.0 years ago by umn_bist380

What was your input size?

Try out instead of outm.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by apelin20480

@apelin20 My input size has been around 8-15 Gb.

ADD REPLYlink written 5.0 years ago by umn_bist380
gravatar for GenoMax
5.0 years ago by
United States
GenoMax96k wrote:

Are you sure these reads are ones that are discarded? The outm option on Brian's program catches anything that matched so it is possible that these are reads that were trimmed to some extent but may still be in the trimmed file (in trimmed form). Can you check?
UPDATE: I think I am right. See the original post by Brian.

The "outm" stream will catch reads that matched a reference kmers.

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by GenoMax96k

Could you explain further

these are reads that were trimmed to some extent but may still be in the trimmed file (in trimmed form)

So these are essentially Kmer contaminants (of min length of 11 bp) that are being catched but remains in the trimmed output file? I've seen on Biostar that Kmers - atleast 7mers seen on FastQC - shouldn't be removed for RNAseq somatic variant calling. Very curious to hear your take on this issue.

If these discarded.fastq files are basically adapters/Kmers/lowquality reads then my command is working as expected.

Thank you very much for your help.

ADD REPLYlink written 5.0 years ago by umn_bist380

Your command is working correctly. But let me clarify the meaning of the output streams:

"out" is the normal output for things that pass your criteria; the "outm" stream catches anything that will not get sent to "out" because it fails the criteria.

When filtering, "out" gets anything without matching kmers, and "outm" gets anything with matching kmers.

When trimming, "out" gets anything at least minlen in length, and "outm" gets anything shorter than minlen. For adapter trimming, being shorter than minlen implies an adapter kmer match, though it could be caused by quality-trimming as well, since you are doing both.

In other words, "outm" is normally for junk reads that fail your quality-control procedure, but in some cases it's useful. For example: " in=reads.fq ref=ecoli.fa out=non-ecoli-reads.fq outm=ecoli-reads.fq" would allow you to analyze those read sets separately. But normally, for adapter-trimming, the outm would just be thrown away.

I don't think 700MB of 15GB is particularly bad, though it's hard to say because those reads were trimmed (700MB of very short reads are more reads than 700MB of untrimmed reads), so what's more important is how much good data remained. RNA-seq is not easy to size-select because doing so will bias your results against short isoforms, and so forth. It's nice to lose under 3% or so during adapter-trimming, but that doesn't always happen. You can look at the insert-size distribution with BBMerge: in1=r1.fq in2=r2.fq mininsert=15 ihist=ihist.txt

Reads with insert size shorter than read length have adapters. Reads with insert size under 40 would be discarded according to your settings, after adapter-trimming. It's possible that you had a lot of adapter-dimers (which will not show up in BBMerge since there is no genomic sequence to merge) or that you had a lot of short inserts, in which case you may want to revise your sequencing protocol slightly.

The warning about removing over-represented kmers seen in FastQC is slightly tangential to this. They may be caused by adapters, and may not. Your BBDuk command uses 23-mers for matching in the main portion of the read, which is safe. It only uses 11-mers for the last 11 bases of the read; trimming those would not cause the read to be discarded anyway because it would still be longer than 40bp.

ADD REPLYlink written 5.0 years ago by Brian Bushnell17k

This was really insightful, thank you. I'll have to fastQC my trimmed outputs - I am cautious that I lost >5% of my reads (after adapter and quality trimming).

I definitely will check my insert-size distribution. Is there a means to have BBMerge auto-relay avg insert-size for each of my trim commands?

Thank you again, Brian.

ADD REPLYlink written 5.0 years ago by umn_bist380

BBDuk internally uses BBMerge to trim adapters based on read insert size if you use the "tbo" flag, which you did. But it doesn't actually report the insert sizes (though the read lengths of the adapter-trimmed reads will be equal to their insert sizes). The insert sizes will not change as a result of these commands.

I'm not quite sure what you mean by auto-relay, but JGI runs BBMerge on all pair libraries to quantify the insert-size distribution and make sure the library meets QC parameters.

ADD REPLYlink written 5.0 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour