Question: Trim The Low-Quality End Of 100Bp Read
4
9.2 years ago by
Bioscientist1.7k
Bioscientist1.7k wrote:

I usually first filtered the fastq read first, tossing out those reads with average phred score< say, 20. But recently I realized long read (100bp) may have the problem of decreasing quality towards the end of read, even if the average pass the criterion. Just like below:

@HWI-ST150_0130:3:64:18989:54871#0/2
AGACTCCCGGGTAGCAAGTACCTGGGACCACAGGTTTGTGCGACCATGCCTGACTAATTTTTGTATTTTTAGTAGTGATGGGGTTTCACTAGGTTGGCGAG
+HWI-ST150_0130:3:64:18989:54871#0/2
ed\dffffffbff^eeeabffffedafffcddbbe\ceacYddadbcdcYbRR][Yccc[_dddddP[ZMaBBBBBBBBBBBBBBBBBBBBBBBBBB


(here format is Illumina 1.5+) Just wondering is there any package to trim off these low-quality end? Also, after the trimming(say trim all bases encoded with "B", representing the lowest quality in Illumina 1.5+), each read will have variable length. Is this OK? What should I do to determined the edit distance parameter, which is dependent on the read length? (so calculate the average read length?)

Also, does BWA have such options?

THANKS

quality trimming bwa • 13k views
written 9.2 years ago by Bioscientist1.7k
3
9.2 years ago by
Andreas2.5k
Singapore
Andreas2.5k wrote:

In case of your Bs it's actually recommended by Illumina to remove them because trailing Bs in version 1.5+ are used as so called Read Segment Quality Control Indicator, i.e. there are not an error rate, but rather indicate that this region should not be used for downstream analysis,

See also the encoding section in the Wikipedia page for FastQ: "...the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator [6]. The Illumina manual [7] (page 30) states the following: If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores)... This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses."

So remove all trailing B's. It's up to you if you want to keep the reads if they get really short. Most mappers can deal with varying read length, so no problem there. If your mapper and/or downstream analysis tools can deal with base call qualities, then there's no need to trim even further in my opinion. In fact other trimming choices are completely arbitrary.

Andreas

thx. learn a lot. Since this "B" has special meaning, do you think it's appropriate to use BWA aln -q option, suggested by Christ Penkett?

Not entirely sure, but I think this should do the trick as well.

2
9.2 years ago by
Gww2.7k
Gww2.7k wrote:

I think it's a great idea to trim off low quality ends on an aligner even if you are using a local aligner. Most aligners support variable length reads and there are numerous trimming tools out there. You may also want to consider trimming sequencing adapters from the 5'-ends of your reads.

1. FastX toolkit
2. SeqPrep (which I've had some luck using)
3. There are also implementations on galaxy and a range of scripting languages.
1
9.2 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

Quality scores from the machine are not very reliable. I got 96bp reads with all Bs as quality matching >90bp to the genome. For at least few applications (i.e. novel transcript discovery) "map first, filter out crap later" may be better. Also if you do paired ends mappings discarding reads is cumbersome. Mapping is also quality control for the reads, and once I got a collaborator stating "we map X % of reads using tophat with following switches", but he forgot to mention how strict his data filtering was -> try to reproduce that.

If you look for unspliced mapper which can trim reads on the fly based on how good they map, I suggest taking a look at http://last.cbrc.jp/

1
9.2 years ago by
Chris Penkett480
Cambridge, UK
Chris Penkett480 wrote:

Use the -q flag in bwa aln to trim reads below that Q score according to

From the manual page:

-q INT Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]


There is also the SolexaQA and prinseq packages that trim reads. SolexaQA has a BWA-like trimming option.

Chris

thx, but here is pipeline Illumina 1.5+. Do I need to first convert it to Sanger format? And how?

bwa aln -I should work: % bwa aln 2>&1 | grep "-I" -I the input is in the Illumina 1.3+ FASTQ-like format

I used the command like: bwa aln -I -q 20

1
8.9 years ago by
Erik10
Erik10 wrote:

fastq-mcf detects whether trimming, clipping are needed and removed low-quality ends while keeping mate-pairs intact. You can use "n/a" as the adapter file, if you don't want clipping done. But you probably do want it done.

0
9.2 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Have a look here.

Also, there is something fishy about those B's - look here.

0
9.2 years ago by
Chris Penkett480
Cambridge, UK
Chris Penkett480 wrote:

This is too long to add as a comment.

To convert fastQ files, I have an alias in my bash startup file (.bashrc):

% grep ill2sanger .bash_aliases
alias ill2sanger="perl -lpe 'BEGIN {\$count = 0} chomp; tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/ if \$count++ % 4 == 3'"


Run it like this:

% ill2sanger tmp.fq
@HWI-ST150_0130:3:64:18989:54871#0/2
AGACTCCCGGGTAGCAAGTACCTGGGACCACAGGTTTGTGCGACCATGCCTGACTAATTTTTGTATTTTTAGTAGTGATGGGGTTTCACTAGGTTGGCGAG
+HWI-ST150_0130:3:64:18989:54871#0/2

% ill2sanger tmp.fq > tmp_sanger.fq


I notice you have the multiple B's at the end of your fastQ read - this is an illumina place-holder to say that cluster may have issues, so set all bases to a quality of 2. This is sometimes a temporary thing for only a few cycles in a cluster and can be turned off to get real qualities for bases in later cycles. This can be done in the RTA illumina base-calling pipeline if you don't want it.

Chris

Actually I now decide just to first trim off those "B".......yeah, because they are simply indicator of bad-quality bases

"B" means "quality unknown". When we did a error distribution of "B" aligned bases agains PhiX shows that they are roughly a quality of "13", in that a little less than 1 out of 10 or so are incorrect. Trimming them is very important however, because B's can have strong sequence-specific bias.