Question: low mapping rate, finding possible source of contamination
Hello, I have received small RNA-seq data from a pathogenic bacteria for bioinformatic analysis. After trimming the adaptors with trim_galore, fastq file was mapped (almost 1 million reads) to the reference genome using bowtie and got overall 26% mapping rate (unique + multiple mapped). Mapping rate dose not change much even if I allow two mismatches. For negative control, reads were also mapped to mouse genome which again gives 25% mapping rate, similar to what I get when I align to the bacterial genome. Most of these mapped reads map to rRNA ans tRNAs, this is why mapping to bacterial genome and mouse genome gives similar results. Now, I do not know what are those 75% unmapped reads. It is possible that there was contamination(s) during library preparation, etc. How can I find the source of contamination? Is there a way to BLAST unmapped reads to find out which genome/strain they are probably coming from?


There are a lot of possibilities, with contamination being just one; others include incomplete adapter-trimming. Sometimes fastQC is helpful in this kind of situation (bowtie cannot map low-quality reads, for example); sometimes, using a different aligner helps, and sometimes BLASTing for contaminant organisms is useful. But for example, "After trimming the adaptors with trim_galore" is not informative - you need to describe the command used, the results, and perhaps the length distribution afterward.

