Before mapping RNAseq reads, TopHat always perform a quality filtering step on the reads when preparing them. I would like to know what the cut-off is for discarding a read? It's difficult to find information about this since most posts about quality and tophat/bowtie relates to mapping quality (naturally) and not read quality.
Cannot find much in the TopHat manual but in the Bowtie manual it says:
Some reads are skipped or "filtered out" by Bowtie 2. For example, reads may be filtered out because they are extremely short or have a high proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM record for such a read, but no alignment will be reported and and the
YF:iSAM optional field will be set to indicate the reason the read was filtered.
YF:Z:LN: the read was filtered becuase it had length less than or equal to the number of seed mismatches set with the-Noption.YF:Z:NS: the read was filtered because it contains a number of ambiguous characters (usuallyNor.) greater than the ceiling specified with--n-ceil.YF:Z:SC: the read was filtered because the read length and the match bonus (set with--ma) are such that the read can't possibly earn an alignment score greater than or equal to the threshold set with--score-minYF:Z:QC: the read was filtered because it was marked as failing quality control and the user specified the--qc-filteroption. This only happens when the input is in Illumina's QSEQ format (i.e. when--qseqis specified) and the last (11th) field of the read's QSEQ record contains1.
My read length is definitely not shorter than the seed mismatches so this first option can be ruled out.
Regarding the second option about ambiguous characters, this is what --n-ceil says:
Sets a function governing the maximum number of ambiguous characters (usually
Ns and/or.s) allowed in a read as a function of read length. For instance, specifying-L,0,0.15sets the N-ceiling functionftof(x) = 0 + 0.15 * x, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are [filtered out]. Default:L,0,0.15.
So default read length * 0.15. Pretty straightforward. No questions here.
The third option regard --ma I would assume does not apply to the default settings since the default setting is to use end-to-end alignment? --ma always equals 0 in this default mode.
The fourth option applies to users specifying the --qseq options so for someone like me who uses fastq, I would assume it's not relevant.
Does that mean with default settings, bowtie/tophat only takes into consideration ambiguous characters? What about fastq base quality/read quality, is this not taken into consideration when filtering?
Appreciate any help and thoughts.