Question: How To End Trim Sequences Using Prinseq?
3
gravatar for Panos
7.6 years ago by
Panos1.6k
Geneva, Switzerland
Panos1.6k wrote:

I ran the following command after reading prinseq's manual

prinseq-lite.pl -fastq seqs.fastq -trim_qual_left 5 -trim_qual_right 5 -out_format 1 -out_good seqs -out_bad null

I would expect from this command to trim low quality ends and give me sequences in which all bases have a quality of at least 5. Nevertheless, when I check the output, I can find lots of low quality bases in the output file and even bases corresponding to these long runs of Bs are not removed!

Am I missing something? Is the command right for what I want to do?

trimming quality • 6.6k views
ADD COMMENTlink written 7.6 years ago by Panos1.6k
6
gravatar for Jeremy Leipzig
7.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I think the developer just took the equation from Heng Li's MAQ faq at http://maq.sourceforge.net/fastq.shtml

$Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10);

which you notice is missing a forward parenthesis somewhere, and which I think was originally for Solexa+64->Sanger conversions anyway. See FASTQ wiki

you see on line 1533 of prinseq-lite.pl:

return ((10 * log(1 + 10 ** (ord($qual) - 64) / 10.0)) / log(10));

he just tacked on an extra ( before the first 10, which means an Illumina phred-64 "B" returns a 10 and a "h" returns a whopping 390

I suspect the correct equation might be:

$Q = 10 *  log(1 + 10 ** ((ord($qual) - 64) /  10.0)) / log(10);

the sad thing is that incorrect equation has been copy-pasted all over the web and even in textbooks

at any rate

return (ord($qual)-64);

should suffice

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Jeremy Leipzig18k

Thanks Jeremy! I replaced the old equation with the (simple) one you said and it works fine! Other quality-dependent parameters were also affected by this ('min_qual_mean' for example which now also gives me the expected output)! Thanks again for your time!

ADD REPLYlink written 7.6 years ago by Panos1.6k

no problem seeing how this bug propogated was actually pretty interesting

ADD REPLYlink written 7.6 years ago by Jeremy Leipzig18k

no problem seeing how this bug propagated was actually pretty interesting

ADD REPLYlink written 7.6 years ago by Jeremy Leipzig18k
2
gravatar for Jeremy Leipzig
7.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

Is this Illumina 1.3 (phred+64) data? From your mention of B runs I suspect it is:

-si13

Quality data in FASTQ file is in Solexa/Illumina 1.3+ format and should be scaled to Phred quality scores ranging from 0 to 40. (Not required for Solexa/Illumina 1.5+, Sanger, Roche/454, Ion Torrent, PacBio data.)
ADD COMMENTlink written 7.6 years ago by Jeremy Leipzig18k

I have tried the '-si13' option but sequences with B runs are still there.

ADD REPLYlink written 7.6 years ago by Panos1.6k

I have tried the '-si13' option but sequences with B runs are still there. One explanation would be that since B qualities have this special meaning, they might not be transformed to a number (as happens with all other quality values)...

ADD REPLYlink written 7.6 years ago by Panos1.6k

i think it is a bug please see my other answer

ADD REPLYlink written 7.6 years ago by Jeremy Leipzig18k
1
gravatar for Bio_X2Y
7.6 years ago by
Bio_X2Y3.6k
Ireland
Bio_X2Y3.6k wrote:

I haven't used this program, but intuitively, I suspect the [?]trim_qual_left[?] and [?]trim_qual_right[?] work by clipping bases from each side until a base with quality 5 or higher is encountered. So you would end up with reads that might still contain low quality bases, but the two bases remaining at the edges of your reads would have a quality of 5 or higher?

Are you trying to remove all reads where any base has a quality of <5, even if the base is in the middle of the read and is flanked by high quality bases? This seems like an unusual thing to want to do. Normally you would want to limit yourself to trimming the ends since that's where you would expect the low quality bases to be (e.g. in an Illumina experiment, you might see the 5' ends of your reads have high quality, and the quality degrades as you move closer to the 3' end). Read aligners tend to be tolerant of one or two low quality bases in the middle of an otherwise good sequence (configured by parameters).

If you want to remove any read with a base with a quality < 5, you might want to try experimenting with the other trimming parameters. E.g. I might start with setting the [?]trim_qual_window[?] to the length of your reads. From what I understand after a quick look at the manual, this might implement the trimming rule by considering the whole read at once instead of a sliding window of 1 base.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Bio_X2Y3.6k

I just want to trim the end of reads. The problem, though, is that "trim_qual_left" and "trim_qual_right" don't work as I expected; after running the above-mentioned command I still get reads with B runs in their 3' end, for example.

ADD REPLYlink written 7.6 years ago by Panos1.6k

Perhaps you can only run one rule at a time? Maybe try running with "trim_qual_left", and then run it again for "trim_qual_right"?

ADD REPLYlink written 7.6 years ago by Bio_X2Y3.6k

Perhaps the program can implement only one rule at a time? Maybe try running with "trim_qual_left", then run the output through the program again, the second time with "trim_qual_right"?

ADD REPLYlink written 7.6 years ago by Bio_X2Y3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1264 users visited in the last hour