I have always faced a problem while analyzing small RNAseq data, at the step of adapter trimming.
I use fastx_clipper to clip adapters. The problem is that you can't be really sure of very small alignments between adapter and reads because they may not really be originating from the adapters, which means that you should specify a lower limit of alignment for clipping. I usually set it as 5 (intuitively).
But if a small piece of sequence really came from the adapter then it will remain and there is no way to clip it without any doubt.
The real problem comes during aligning the reads to the reference sequence. Aligners like bowtie (which i prefer to use), generally have user defined argument for number of allowed mismatches, which generally shouldn't exceed 4 both for performance and stringency.
Subsequently, you might lose a really valuable read.
An example: A certain unknown miRNA (annotated as one in mirbase) is 28 nt long. If I clip adapters with minimum alignment of 5 then I end up having 2-3 nucleotides of adapter in one of the highly expressed miRNA in that tissue, which then goes undetected because of mismatch in alignment.
To avoid this problem I sometimes trim the reads to around 25nt (for miRNA profiling). This creates a new problem: You can't really distinguish whether the read came from a pre-miRNA (a longer RNA) or from mature-miRNA (smaller RNA that arises by processing of pre-miRNA)
If I already have some idea of what I am going to get after analysis, I can tweak it somehow (for e.g by grep-ing the first 2/3/4 letters of the adapter sequence in the miRNA database to find out the chance of random occurrence of adapter like sequences). But if I am looking for something new it wont help.
Does anyone have an experience or an idea about how to solve this problem?