Question

Read trimming with BBDuk

3

Entering edit mode

7.0 years ago

seta ★ 1.9k

Hi everybody,

I have several sequencing files of Illumina paired-end reads resulted from NEBNex kit (Prep Master Mix Set for Illumina, E6040, BioLabs) and sequencing by HiSeq 2000. Based on FastQC analysis, for all samples, the length of one set read (from paired-end) is 100bp and the length of the second read is 80 bp. I'll glad if you please let me know why the length of two set reads, corresponding to paired-end reads, are different? Is it normal or there is something wrong?

Anyway, for filtering and adapter trimming, I used bbduk from bbmap package (version 37.17) with the following command:

./bbduk.sh in=file_1.fastq in2=file_2.fastq out=out1.fastq out2=out2.fastq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=20 ftl=20 ftr=90  minlen=40

Based on re-checking the quality of generated output by FastQC, It sounds that everything OK except for "per base sequence content" and "sequence length distribution". Please see the attached images. Even with removing the first and end bases, the "per base sequence content" still failed Image 1. The sequence length changed from 100 bp to the range of sequences with 41-70 bp in length Image 2. Please kindly tell me what's wrong with my command and how to solve it?

Also, 40% of bases removed after trimming and the read length reduced, which is not my desired. Could you please advise me how to keep more read as possible as for a successful downstream analysis?

Thanks in advance

bbduk read trimming bbmap • 7.6k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 7.0 years ago by seta ★ 1.9k

0

Entering edit mode

Is there an inline barcode of some sort here that you are trying to remove by the aggressive front end trimming?

ADD REPLY • link 7.0 years ago by GenoMax 142k

GenoMax · Answer 1 · 2017-04-30

1

Entering edit mode

7.0 years ago

h.mon 35k

None of your images got linked.

You are trimming too aggressively here, why trim 30bp of every 100bp read? Remove the ftl=20 ftr=90 parameters.

It is not normal read1 is 100bp and read2 is 80bp - didn't you asked this already? This is public available data, just live with it. Or contact the authors if it bothers you that much. What is the original paper?

ADD COMMENT • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Thanks for your response and sorry for images, but they appeared to me. could you please try to take a look at them, again? May try to open them with right click and select "open image in new tab". As you suggested, I removed the two flags (ftl=20 ftr=90), but when I re-checked the quality of trimmed reads using fastqc, the GC graph was odd unlike before trimming (enter image description here). Also, the sequence length distribution has changed from 100 bp (before trimming) to 40-100 bp after trimming (enter image description here. Could you please help me on this issues?

Yes, I asked it and you kindly advised me to use bbduck for read trimming, however, I don't still know if the different length of reads would be problematic for downstream analysis. The original paper can be found at enter link description here.

Thanks

ADD REPLY • link updated 7.0 years ago by GenoMax 142k • written 7.0 years ago by seta ★ 1.9k

0

Entering edit mode

Do not get hung-up on the FastQC results. If you feel that you have gotten rid of the extraneous sequence (that do not belong to your sample) go on to the next set of analysis steps. If something there does not start making sense then come back to diagnose further.

As for your second image, since some of the reads were trimmed they are no longer the full length (i.e. 100 bp). As a result you are seeing the sequence length distribution that includes reads of various length. Different length of reads should be ok for downstream steps unless you want to filterout really small ones (e.g. < 10 bp).

ADD REPLY • link 7.0 years ago by GenoMax 142k

0

Entering edit mode

Thank you, genomax2. Ok, I'll go ahead with the same command for all samples. My mean was the different length of two set reads, corresponding to paired-end read, as I posted one set read is 100 bp and another is 80 bp.

ADD REPLY • link 7.0 years ago by seta ★ 1.9k