Question: What Is A Typical Quality Cutoff When Trimming Rnaseq Data?
1
gravatar for camelbbs
7.0 years ago by
camelbbs660
China
camelbbs660 wrote:

HI I want to ask a question about trimming rnaseq.

By fastqc, i got result of sequence quality and i don't what is usual cutoff for trimming. I see the end 100-101 is lower than 20.

There are three regions of score: 0-20, 20-28, 28-40, any general cutoff for this? Is ok to cut to 95bp.

And my pairend reads are not the same length. left-end is 101 and right end is 121. is that ok for analysis? do i need to cut the 121 to 95 or else?

Thanks a lot!!

rna-seq • 4.8k views
ADD COMMENTlink modified 7.0 years ago by Arun2.3k • written 7.0 years ago by camelbbs660
2
gravatar for Arun
7.0 years ago by
Arun2.3k
Germany
Arun2.3k wrote:

Usually, in Casava >=1.8 format, I look for consecutive set of bases, for each read, with quality "#". This is similar to the "B" in older formats. For example:

@HWI-ST863:108:C0F9BACXX:8:1101:1235:1979
CTTACTAAAGACAATGGTGTTCTTCTCATCTTCGATGAAGTCATGACTGGATTTCGTCTAGCCTATGGTGGAGCCCAAGAATACTTTGGAATCACGC
+1:N:0:ATCACG
CCCFFFEFHHHHHJJJJFEIIJJJGHIJJHIJJJIJJIEHIIIJIIJIJJGIIIJHIGH######################################

I trim to obtain:

@HWI-ST863:108:C0F9BACXX:8:1101:1235:1979
CTTACTAAAGACAATGGTGTTCTTCTCATCTTCGATGAAGTCATGACTGGATTTCGTCT
+1:N:0:ATCACG
CCCFFFEFHHHHHJJJJFEIIJJJGHIJJHIJJJIJJIEHIIIJIIJIJJGIIIJHIGH

This has resulted in an average quality >= 30 for all my fastq files. When I mean average, its not per read, rather, the average quality at each base over all reads. If the trimmed sequence length is > a threshold ( 60 for 100 base reads in my case), then I keep them. Otherwise, I discard them. Just to give an idea. This can be done separately for both pairs. If one of the pairs gets discarded while trimming for quality, then I keep the other as a single end read in a separate file. After quality trimming, I balance the pairs so that each pair is on the same line between the 2 fastq files. However, I don't understand how/why the paired end reads are of different lengths.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Arun2.3k

Note that trimming "#" from Illumina reads is not the same as trim for average quality>=30.

ADD REPLYlink written 7.0 years ago by Sean Davis25k

True, I have modified my reply. Thank you Sean.

ADD REPLYlink written 7.0 years ago by Arun2.3k

Thanks! My pair-end reads are really in different length. 101 and 121. Does that affect the analysis?

ADD REPLYlink written 7.0 years ago by camelbbs660
2
gravatar for Sean Davis
7.0 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Trimming based on length is probably going to be less than optimal since you will likely end up cutting off some high-quality bases in some reads and leaving low-quality bases in others. I'd suggest using any one of a dozen or more trimming softwares (cutadapt, fastx, sickle, trimmomatic) to do the quality trimming based on individual read base qualities.

ADD COMMENTlink written 7.0 years ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2017 users visited in the last hour