Question: Confused about the quality control strategy of 16s pair_end sequence from Illumina miseq platform
2
gravatar for hua.peng1314
4.2 years ago by
hua.peng131490
China
hua.peng131490 wrote:

I am new to metagenome and I am confused about the quality control strategy of 16s pair_end sequence from Illumina miseq platform. No doubt quality control strategy may affect the  downstream analysis. I find the slide window of 50bp is common used. But for pair_end reads, I used FLASH software to assemble the contigs, I am not sure about the QC strategy. Reads are truncated at the end of the last window before the average quality score falls below the threshold, even if downstream windows would again rise above the average quality score threshold. Unfortunately about half of my reads were truncated too short. So I used the strategy of FASTX with -p 60 -q 20.Little reads were trimmed. Is it not strict enough? any suggestion? Thanks.

quality control 16s • 4.0k views
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by hua.peng131490
1
gravatar for marina.v.yurieva
4.2 years ago by
Farmington, CT
marina.v.yurieva480 wrote:

Use flash first and do quality trimming after that. Flash uses quality scores for merging reads and your joined read would have a better quality after merging.

ADD COMMENTlink written 4.2 years ago by marina.v.yurieva480
0
gravatar for hua.peng1314
4.2 years ago by
hua.peng131490
China
hua.peng131490 wrote:

Thanks for reply. I just did as you say. What confuse me is what QC strategy should be performed after FLASH.

The slide window of 50bp and the FASTX with -p 60 -q 20 seem not suitable enough.

ADD COMMENTlink written 4.2 years ago by hua.peng131490

I see. Sorry, didn't get it from your post. I had that problem with fastx before and not sure what is the reason for that. It seems like other people also have that problem https://biostar.usegalaxy.org/p/7715/ At the end I used a quality trimmer built-in into Pipeline Pilot but haven't found a free analog for it. Have you tried other trimmers, like trimmomatic?

ADD REPLYlink written 4.2 years ago by marina.v.yurieva480
0
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

This is a good reason to not use arbitrary-sized sliding windows.  BBDuk's quality-trimming gives optimal output for a given quality threshold and does not rely on specific window sizes; rather, "trimq=X" guarantees that the result will be the largest subsequence with average quality of at least X such that extending in either direction would add a subsequence with average quality below X.

I do not recommend fastx anyway as it defaults to the wrong quality encoding and is incapable of processing paired reads together, even aside from the fact that it is slow and uses non-optimal algorithms.  Trimmomatic also relies on windows, and is also slow, so I don't recommend that either (though at least it processes pairs together).

If you merge reads with BBMerge, though, I do not recommend trimming first; it performs trimming internally only if needed.  Trimming first can reduce merge rate by eliminating the overlapping parts of reads.

ADD COMMENTlink written 4.2 years ago by Brian Bushnell16k
0
gravatar for hua.peng1314
4.2 years ago by
hua.peng131490
China
hua.peng131490 wrote:

BBDuk is really an excellent work. But as the reads with lower quality bases at the end. After merging the lowest quality region would be in the middle of the merged reads.No matter what kind of sliding window would probably 

cut the reads at the middle of the reads. So  merging will be less significance. Is that right? 

I find many peoples  only use the R1 reads.The information included in the R2 reads lost.

So I prefer FASTX on the whole merged reads.But just as Marina and Brian said. It also has some problems. 

I wonder if there is a better choice.

ADD COMMENTlink written 4.2 years ago by hua.peng131490

20 is an extremely high threshold for quality trimming; too high for most purposes.  And after merging, the overlapping bases will have their quality scores increased anyway if they match, to reflect the fact that 2 independent observations were made of the same base.  Also, trimming tools in general don't trim middle bases and break sequences apart - they trim at the ends.  For sliding windows, they generally start at one end, trim until the average inside the window is above some value, then stop.

Are you sure you need to do trimming?  What are you doing with the data after you have trimmed and/or merged it?

ADD REPLYlink written 4.2 years ago by Brian Bushnell16k
0
gravatar for hua.peng1314
4.2 years ago by
hua.peng131490
China
hua.peng131490 wrote:

Thanks for your patient and friendly. I add answer but not reply because I can't click the "ADD REPLY" button.Maybe I am stupid to have ignored something.

Trimming is not necessary for me.I just want to do something to make the OTUs to be generated correctly. 

After merging with FLASH(default parameters) I used the split_library function in qiime with the recommend parameter -q 19 and found about half of reads were truncated shorter than 200bp.

So I change to FASTX with -p 60 -q 20.Little reads were filtered.After Chimera filtering and OTU clustering with the default pick_otu function in QIIME. I get 1121840 OTU at last and I have about 4188862 reads before Chimera filtering.Is it too much?

Moreover I used the unique.seqs(MOTHUR) to do dereplication and get 4019141 unique reads.I have deal with some 454 data with MOTHUR before.I am not sure if it's normal as from different platform.

So I wonder if my merging and QC steps are right and ask for help here.

 

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by hua.peng131490

Well, in my opinion, quality of your reads is very important for OTU picking. You don't want to pick a wrong OTU or miss it because of the quality of your data. 

1121840 OTUs is quite a lot but I don't know the nature of your data. 

Are you combining QIIME and mothur pipelines or comparing them?

4019141 unique reads out of 4188862 does look suspicious. I was getting much lower number of unique reads with 16s miseq. But again, I have no idea about the nature of your data.

ADD REPLYlink written 4.2 years ago by marina.v.yurieva480
0
gravatar for hua.peng1314
4.2 years ago by
hua.peng131490
China
hua.peng131490 wrote:

My samples are collected from potting soil. I also think it looks really suspicious.

combining QIIME and mothur pipelines for mothur is hard to deal with so many reads.

Here is what I did:

flash -M 300 R1.fastq R2.fastq

less out.extendedFrags.fastq|/fastx/fastx_barcode_splitter.pl --bcfile forward_primer --bol --mismatches 2  --prefix p1. --suffix .fastq(The data I get have already been split  according to barcodes and I do this to check the forward primer.)

fastx/fastq_quality_filter -i out.extendedFrags.fastq -q 20 -p 60 -l 200 -o out.fastq -Q 33

fastx/fastq_to_fasta -i out.fastq -o out.fasta

mothur "#chimera.uchime(fasta=out.fasta,reference=gold.fa,processors=20)"

mothur "#remove.seqs(fasta=out.fasta,accnos=meta.unique.uchime.accnos)"

pick_otus.py -i seqs.fna -o picked_otus_default

Any problem?

ADD COMMENTlink written 4.2 years ago by hua.peng131490

pick_otus.py with the parameters you chose gives you just a file with your clustered reads, not OTUs. If you want to use reference either run pick_otus.py -i seqs.fna -r refseqs.fasta -m uclust_ref or pick_closed_reference_otus.py script (I would recommend the second as it gives you the biom file and avoids all the pain of converting files or running multiple scripts).

ADD REPLYlink written 4.2 years ago by marina.v.yurieva480
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2251 users visited in the last hour