I am new to metagenome and I am confused about the quality control strategy of 16s pair_end sequence from Illumina miseq platform. No doubt quality control strategy may affect the downstream analysis. I find the slide window of 50bp is common used. But for pair_end reads, I used FLASH software to assemble the contigs, I am not sure about the QC strategy. Reads are truncated at the end of the last window before the average quality score falls below the threshold, even if downstream windows would again rise above the average quality score threshold. Unfortunately about half of my reads were truncated too short. So I used the strategy of FASTX with -p 60 -q 20.Little reads were trimmed. Is it not strict enough? any suggestion? Thanks.
This is a good reason to not use arbitrary-sized sliding windows. BBDuk's quality-trimming gives optimal output for a given quality threshold and does not rely on specific window sizes; rather, "trimq=X" guarantees that the result will be the largest subsequence with average quality of at least X such that extending in either direction would add a subsequence with average quality below X.
I do not recommend fastx anyway as it defaults to the wrong quality encoding and is incapable of processing paired reads together, even aside from the fact that it is slow and uses non-optimal algorithms. Trimmomatic also relies on windows, and is also slow, so I don't recommend that either (though at least it processes pairs together).
If you merge reads with BBMerge, though, I do not recommend trimming first; it performs trimming internally only if needed. Trimming first can reduce merge rate by eliminating the overlapping parts of reads.
BBDuk is really an excellent work. But as the reads with lower quality bases at the end. After merging the lowest quality region would be in the middle of the merged reads.No matter what kind of sliding window would probably
cut the reads at the middle of the reads. So merging will be less significance. Is that right?
I find many peoples only use the R1 reads.The information included in the R2 reads lost.
So I prefer FASTX on the whole merged reads.But just as Marina and Brian said. It also has some problems.
I wonder if there is a better choice.
Thanks for your patient and friendly. I add answer but not reply because I can't click the "ADD REPLY" button.Maybe I am stupid to have ignored something.
Trimming is not necessary for me.I just want to do something to make the OTUs to be generated correctly.
After merging with FLASH(default parameters) I used the split_library function in qiime with the recommend parameter -q 19 and found about half of reads were truncated shorter than 200bp.
So I change to FASTX with -p 60 -q 20.Little reads were filtered.After Chimera filtering and OTU clustering with the default pick_otu function in QIIME. I get 1121840 OTU at last and I have about 4188862 reads before Chimera filtering.Is it too much?
Moreover I used the unique.seqs(MOTHUR) to do dereplication and get 4019141 unique reads.I have deal with some 454 data with MOTHUR before.I am not sure if it's normal as from different platform.
So I wonder if my merging and QC steps are right and ask for help here.
My samples are collected from potting soil. I also think it looks really suspicious.
I combining QIIME and mothur pipelines for mothur is hard to deal with so many reads.
Here is what I did:
flash -M 300 R1.fastq R2.fastq
less out.extendedFrags.fastq|/fastx/fastx_barcode_splitter.pl --bcfile forward_primer --bol --mismatches 2 --prefix p1. --suffix .fastq(The data I get have already been split according to barcodes and I do this to check the forward primer.)
fastx/fastq_quality_filter -i out.extendedFrags.fastq -q 20 -p 60 -l 200 -o out.fastq -Q 33
fastx/fastq_to_fasta -i out.fastq -o out.fasta
pick_otus.py -i seqs.fna -o picked_otus_default