Improving the coverage of raw fastq data with Trimmomatic CROP option
2
1
Entering edit mode
5.5 years ago
Ruf_M ▴ 10

I have a set of isolates sequenced with MiSeq (Nextera XT DNA, paired-end reads, 600 cycles). However, the MiSeq samples suffered a major quality drop from the 150 to the 300 cycle, and from the 450 to the 600 cycle, probably due to a problem with the MiSeq reagent kit v3 lot.

The coverage I get for this samples, using samtools, goes from 28x to 80x. I would like to improve the coverage of this samples and thus I tried to cut the raw reads after the first 150 bp using the Trimmomatic CROP option:

§ java -jar trimmomatic-0.36.jar PE -threads 8 -trimlog input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz  ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 CROP:150


I have also tried fastx_trimmer:

§ fastx_trimmer -l 150 -z -i INFILE -o OUTFILE


Unfortunately, I am not able to greatly improve the coverage. Do you think I should use other trimming options? Any suggestions?

Thanks!

trimmomatic crop coverage MiSeq • 2.9k views
0
Entering edit mode

Why do you think cutting the reads up is going to improve coverage? Do you suspect the reads are not aligning because of bad qualities towards the end?

0
Entering edit mode

Yes, I believe it could be the reason. For this run was used a MiSeq reagent kit v3 that enables read lengths of 2 x 300 bp with the 600-cycle kit. We observed a big quality drop exactly from the 150 to the 300 cycle, and from the 450 to the 600 cycle (edited the question with this more accurate info). Thus, I would suppose that from around 150 bp (fw and rev), the sequence is of bad quality and, therefore, affects the average coverage at the end. Does this make sense?

The idea was then to trim the reads so this low quality region is excluded, see if it increases the average coverage of the sample, and ultimately avoid sequencing the samples again.

1
Entering edit mode

While 2x300 cycle kits are known to have "issues" with Q-scores it also sounds to me like you probably have a classic case of a fraction of inserts being shorter than expected. This leads to "adapter read-through" and big drops in quality later in sequencing cycles.

Have you tried to see if you can merge the reads merge (use bbmerge from BBMap with original/untrimmed data)? If a large fraction of the reads is able to merge then I would suggest going with the merged representation of the reads for further analysis.

1
Entering edit mode
5.5 years ago
Coverage = read_length * number_read_mapped / genome_length


If you want to improve coverage, I don't think removing 75% of your reads is a good strategy because coverage is proportional to read length.

An alternative could be to increase the number of read mapped by relaxing the mapping criteria. This is of course very theorethical and I don't recommend to dramatically change the mapping parameters.

0
Entering edit mode

I apologize, maybe it was not clear in my question (I edited it with more accurate info). I would be excluding 50% of the reads, since they are 300 bp long and I am keeping just the first 150 bp, as I believe the rest is of low quality. I agree, I am not really comfortable with changing the mapping criteria, but I can give it a try before deciding whether or not I should repeat the sequencing.

0
Entering edit mode

Thanks for clarifying, I read 600 cycles and forgot it was paired-end.

1
Entering edit mode
5.5 years ago

For low-quality data, you will get higher mapped coverage by mapping with BBMap compared to other aligners, so I'd suggest trying that first. If the mapping rate is still low, I'd recommend trimming using a quality score rather than a fixed length, which is usually a bad idea. For example, using BBDuk (from the BBMap package):

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=r trimq=12


Note that adapters.fa is in /bbmap/resources/