Before mapping RNAseq reads, TopHat always perform a quality filtering step on the reads when preparing them. I would like to know what the cut-off is for discarding a read? It's difficult to find information about this since most posts about quality and tophat/bowtie relates to mapping quality (naturally) and not read quality.
Cannot find much in the TopHat manual but in the Bowtie manual it says:
Some reads are skipped or "filtered out" by Bowtie 2. For example, reads may be filtered out because they are extremely short or have a high proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM record for such a read, but no alignment will be reported and and the `YF:i` SAM optional field will be set to indicate the reason the read was filtered. * `YF:Z:LN`: the read was filtered becuase it had length less than or equal to the number of seed mismatches set with the `-N` option. * `YF:Z:NS`: the read was filtered because it contains a number of ambiguous characters (usually `N` or `.`) greater than the ceiling specified with `--n-ceil`. * `YF:Z:SC`: the read was filtered because the read length and the match bonus (set with `--ma`) are such that the read can't possibly earn an alignment score greater than or equal to the threshold set with `--score-min` * `YF:Z:QC`: the read was filtered because it was marked as failing quality control and the user specified the `--qc-filter` option. This only happens when the input is in Illumina's QSEQ format (i.e. when `--qseq` is specified) and the last (11th) field of the read's QSEQ record contains `1`.
My read length is definitely not shorter than the seed mismatches so this first option can be ruled out.
Regarding the second option about ambiguous characters, this is what --n-ceil says:
Sets a function governing the maximum number of ambiguous characters (usually `N`s and/or `.`s) allowed in a read as a function of read length. For instance, specifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are [filtered out]. Default: `L,0,0.15`.
So default read length * 0.15. Pretty straightforward. No questions here.
The third option regard --ma I would assume does not apply to the default settings since the default setting is to use end-to-end alignment? --ma always equals 0 in this default mode.
The fourth option applies to users specifying the --qseq options so for someone like me who uses fastq, I would assume it's not relevant.
Does that mean with default settings, bowtie/tophat only takes into consideration ambiguous characters? What about fastq base quality/read quality, is this not taken into consideration when filtering?
Appreciate any help and thoughts.