Hello,
In my current job, I am dealing with perturbation rna-seq data from cancer biopsy. The sequencing library is prepared using heat lysis and polyA tail enrichment to only select mRNA data.
Here is my problem, after trimming and assessing sequencing quality (which is good), I notice a very low mapping ratio on my data (between 10 to 30%). I use kallisto for mapping with kmer=31. I check for my reads size distribution, and they are most of the time over the required 31 n lenght.
I checked for contamination using fastq_screen, and most of the reads (>90%) mapped to human genome on a subset of 2M reads.
Do you have any idea where these non-mapping reads comes from? Is it possible that my reads are mostly from intronic region (even with polyA purification, explaining why we see high mapping ratio with bowtie2 from fastq_screen)?
Could be genomic DNA contamination. Enrichment is just that, enrichment, not perfect selection without noise. Poor RNA quality usually increases noise. In the end it does not really matter, since in silico magic cannot save library issues. If on-target counts are too low you have to sequence deeper.
Kallisto is a pseudoaligner, so, as ATpoint suggested, it could be DNA contamination. You can confirm this using an aligner and a reference genome.
Looks like OP has done that, assuming that the samples are human.