Question

Trim The Low-Quality End Of 100Bp Read

4

Entering edit mode

12.4 years ago

Bioscientist ★ 1.7k

I usually first filtered the fastq read first, tossing out those reads with average phred score< say, 20. But recently I realized long read (100bp) may have the problem of decreasing quality towards the end of read, even if the average pass the criterion. Just like below:

@HWI-ST150_0130:3:64:18989:54871#0/2
AGACTCCCGGGTAGCAAGTACCTGGGACCACAGGTTTGTGCGACCATGCCTGACTAATTTTTGTATTTTTAGTAGTGATGGGGTTTCACTAGGTTGGCGAG
+HWI-ST150_0130:3:64:18989:54871#0/2
ed\`dffffffbff^eeeabffffedafffcddbbe`\cea``cYddadbcdcYbRR][Yccc[_dddddP[ZMaBBBBBBBBBBBBBBBBBBBBBBBBBB

(here format is Illumina 1.5+) Just wondering is there any package to trim off these low-quality end? Also, after the trimming(say trim all bases encoded with "B", representing the lowest quality in Illumina 1.5+), each read will have variable length. Is this OK? What should I do to determined the edit distance parameter, which is dependent on the read length? (so calculate the average read length?)

Also, does BWA have such options?

THANKS

quality trimming bwa • 15k views

ADD COMMENT • link updated 12.4 years ago by Erik ▴ 10 • written 12.4 years ago by Bioscientist ★ 1.7k

Ram · Answer 1 · 2011-11-25

In case of your Bs it's actually recommended by Illumina to remove them because trailing Bs in version 1.5+ are used as so called Read Segment Quality Control Indicator, i.e. there are not an error rate, but rather indicate that this region should not be used for downstream analysis,

See also the encoding section in the Wikipedia page for FastQ: "...the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator [6]. The Illumina manual [7] (page 30) states the following: If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores)... This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses."

See also this link on open-bio.

So remove all trailing B's. It's up to you if you want to keep the reads if they get really short. Most mappers can deal with varying read length, so no problem there. If your mapper and/or downstream analysis tools can deal with base call qualities, then there's no need to trim even further in my opinion. In fact other trimming choices are completely arbitrary.

Andreas

score 2 · Answer 2 · 2011-11-24

I think it's a great idea to trim off low quality ends on an aligner even if you are using a local aligner. Most aligners support variable length reads and there are numerous trimming tools out there. You may also want to consider trimming sequencing adapters from the 5'-ends of your reads.

FastX toolkit
SeqPrep (which I've had some luck using)
There are also implementations on galaxy and a range of scripting languages.

score 1 · Answer 3 · 2011-11-25

Quality scores from the machine are not very reliable. I got 96bp reads with all Bs as quality matching >90bp to the genome. For at least few applications (i.e. novel transcript discovery) "map first, filter out crap later" may be better. Also if you do paired ends mappings discarding reads is cumbersome. Mapping is also quality control for the reads, and once I got a collaborator stating "we map X % of reads using tophat with following switches", but he forgot to mention how strict his data filtering was -> try to reproduce that.

If you look for unspliced mapper which can trim reads on the fly based on how good they map, I suggest taking a look at http://last.cbrc.jp/

score 1 · Answer 4 · 2011-11-25

1

Entering edit mode

12.4 years ago

Chris Penkett ▴ 490

Use the -q flag in bwa aln to trim reads below that Q score according to

From the manual page:

-q INT Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]

There is also the SolexaQA and prinseq packages that trim reads. SolexaQA has a BWA-like trimming option.

Chris

ADD COMMENT • link 12.4 years ago by Chris Penkett ▴ 490

0

Entering edit mode

thx, but here is pipeline Illumina 1.5+. Do I need to first convert it to Sanger format? And how?

ADD REPLY • link 12.4 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

bwa aln -I should work: % bwa aln 2>&1 | grep "-I" -I the input is in the Illumina 1.3+ FASTQ-like format

ADD REPLY • link 12.4 years ago by Chris Penkett ▴ 490

0

Entering edit mode

I used the command like: bwa aln -I -q 20

ADD REPLY • link 12.4 years ago by Bioscientist ★ 1.7k

score 1 · Answer 5 · 2012-02-22

1

Entering edit mode

12.2 years ago

Erik ▴ 10

fastq-mcf detects whether trimming, clipping are needed and removed low-quality ends while keeping mate-pairs intact. You can use "n/a" as the adapter file, if you don't want clipping done. But you probably do want it done.

ADD COMMENT • link 12.2 years ago by Erik ▴ 10

Ram · Answer 6 · 2011-11-24

0

Entering edit mode

12.4 years ago

Martin A Hansen 3.0k

Have a look here.

Also, there is something fishy about those B's - look here.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.4 years ago by Martin A Hansen 3.0k

score 0 · Answer 7 · 2011-11-28

0

Entering edit mode

12.4 years ago

Chris Penkett ▴ 490

This is too long to add as a comment.

To convert fastQ files, I have an alias in my bash startup file (.bashrc):

% grep ill2sanger .bash_aliases 
alias ill2sanger="perl -lpe 'BEGIN {\$count = 0} chomp; tr/\x40-\xff\x00-\x3f/\x21-\xe0\x21/ if \$count++ % 4 == 3'"

Run it like this:

% ill2sanger tmp.fq
@HWI-ST150_0130:3:64:18989:54871#0/2
AGACTCCCGGGTAGCAAGTACCTGGGACCACAGGTTTGTGCGACCATGCCTGACTAATTTTTGTATTTTTAGTAGTGATGGGGTTTCACTAGGTTGGCGAG
+HWI-ST150_0130:3:64:18989:54871#0/2
FE=AEGGGGGGCGG?FFFBCGGGGFEBGGGDEECCFA=DFBAAD:EEBECDED:C33><:DDD<@EEEEE1<;.B##########################

% ill2sanger tmp.fq > tmp_sanger.fq

I notice you have the multiple B's at the end of your fastQ read - this is an illumina place-holder to say that cluster may have issues, so set all bases to a quality of 2. This is sometimes a temporary thing for only a few cycles in a cluster and can be turned off to get real qualities for bases in later cycles. This can be done in the RTA illumina base-calling pipeline if you don't want it.

Chris

ADD COMMENT • link 12.4 years ago by Chris Penkett ▴ 490

0

Entering edit mode

Actually I now decide just to first trim off those "B".......yeah, because they are simply indicator of bad-quality bases

ADD REPLY • link 12.4 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

"B" means "quality unknown". When we did a error distribution of "B" aligned bases agains PhiX shows that they are roughly a quality of "13", in that a little less than 1 out of 10 or so are incorrect. Trimming them is very important however, because B's can have strong sequence-specific bias.

ADD REPLY • link 12.2 years ago by Erik ▴ 10