5.3 years ago by
For what it's worth, I always clean my raw data, at the very least for poor-quality base calls or poor-quality reads in general. So do my colleagues. The real (and somewhat longer) answer to your question, however, is rooted in what type of data you're generating and what you want to ultimately do with it.
If you have shotgun libraries and are looking to assemble a whole genome, keeping in low-quality reads and bases increases complexity and can dramatically increase run time. It's pretty striking; I've seen assembler get 'hung up' trying to sort out the k-mer graph, and the problem can disappear once poor reads are removed. This also applies to duplicates.
If you're just mapping to a reference and calling variants, it's less of a deal nowadays than it was a few years ago. BWA's MEM algorithm can soft-clip reads to improve mapping quality, and this is useful if you have residual adapters or low-quality spans at the beginning. I see this as a secondary bonus of sorts, but I would still trim my reads.
Also, you might have a non-random distribution of k-mers or subsequences represented just based on your library prep. Imagine that you PCR a single locus and then make a library out of it. You will definitely have an overrepresentation. Scaling up, say you targeted and sequenced the exome. Again, your distribution of k-mers might be non-random because you might expect to see certain motifs overrepresented (start/stop codons, for example).
So, I would clean my raw data using a series of best practices (remove low-quality bases/reads/adapters, identify overlaps, dedup - but the dedup doesn't apply to your expression data). I would also be a bit leery to just chop bases off for no reason other than a summary report suggests overrepresentation. The question to ask is, "Will this affect the biological interpretation of my data systematically?"
I would love to hear others' thoughts.