Question: Minimum length of reads after trimming for Assembly
gravatar for Ric
4.5 years ago by
Ric330 wrote:

Hello, My Illumina paired-end reads (version 1.9) length various between 35-151. I noticed that the forward and reverse reads do not have the same length. Here are the Fastqc quality plots for R1 and R2.

Is the following Trimmomatic command optimal set to run for the above QC plots and used the trimmed reads for assembly?

java -jar /programs/trimmomatic/trimmomatic-0.32.jar PE -phred33 paired_end_reads_1.fastq paired_end_reads_2.fastq kept_paired_end_reads_1.fastq kept_paired_end_reads_2.fastq unpaired_1.fastq unpaired_2.fastq  SLIDINGWINDOW:4:15 MINLEN:65

Thank you in advance.


ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Ric330

If you want optimal results, I suggest you start over with the raw data. Your reads have probably already been adapter-trimmed by Illumina's software, which tends to be mediocre, and is the reason the reads have different lengths. Since your reads are all 2x151bp, correct adapter-trimming will always leave R1 and R2 exactly the same length. As for quality-trimming, sliding-window-based trimming is also not optimal. There exists an optimal quality-trimming algorithm, which I'll call the "Phred algorithm", and it is implemented in seqtk and BBDuk.

The minimum length after trimming is entirely at your discretion. I'd recommend setting it at the kmer length you plan to use for assembly.

If you can obtain the raw reads, I suggest you trim with BBDuk using this command: in1=r1.fq in2=r2.fq out1=trimmed1.fq out2=trimmed2.fq ktrim=r k=23 mink=11 hdist=1 tpe tbo ref=adapters.fa qtrim=rl trimq=15

That will do both adapter and quality trimming. "adapters.fa" is included with the BBMap package and contains all public Illumina adapter sequences.

ADD REPLYlink written 4.5 years ago by Brian Bushnell17k

How to determine the k-mer value for abyss or SPAdes?

ADD REPLYlink written 4.5 years ago by Ric330

Adapter trimming is almost always preferred, but don't apply quality trimming for assembly.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by lh332k

I agree in principle, but depending on the genome size, data quality, and assembler, quality trimming can sometimes make the difference between generating an assembly and running out of memory and crashing.

For small genomes that fit in memory with no problem it's true that quality-trimming is unnecessary and can cause inferior assemblies. It depends on how the assembler processes quality scores, though.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Brian Bushnell17k

There are a few papers on this topic. I have also tried quality trimming myself. In all these cases, quality trimming hurts de novo assembly. That said, it is in theory possible that some combination of trimmer/assembler may produce better results.

ADD REPLYlink written 4.5 years ago by lh332k

There are also papers that state the opposite. For example (from ):

we observed that quality-based trimming of raw data gave ∼15-fold improvements in N50 statistics

Really depends on the data and the assembler. As with everything else with assembly-related, it seems the best strategy is "try a bunch of options and see what works best for you".

ADD REPLYlink written 4.5 years ago by igor12k

Thanks. Didn't know this paper. However, "15-fold" looks really suspicious. I wanted to know how this 15-fold was derived, but the paper gives me little context. On the trimming strategy, the paper cited a 2012 paper by the same group, where they only mentioned the CLC suite and FASTQC without details. The paper also cited the GAGE paper, but GAGE does not discuss trimming as I remember. In addition, the paper did not say which assembler is this sensitive to quality-based trimming, let alone detailed statistics. I don't know how much I should trust this paper.

ADD REPLYlink written 4.5 years ago by lh332k

I agree it looks questionable. I didn't realize there were papers discussing trimming strategies (I don't have a lot of assembly experience, but the studies I saw generally focus on the assembly tools rather than pre-processing). After seeing your comment, I decided to investigate further and that just happened to be the first hit.

ADD REPLYlink written 4.5 years ago by igor12k

I would like to try Abbyss and SPAdes out to assemble the above reads. Is it a good idea to use FASTuniq to remove duplicates before assembly?

ADD REPLYlink written 4.5 years ago by Ric330

Probably not. Duplicate-removal is only useful for amplified libraries, and is mainly for variant-calling when resequencing. If your library was not amplified, do not remove duplicates.

ADD REPLYlink written 4.5 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1086 users visited in the last hour