Hello,
In my current job, I am dealing with perturbation rna-seq data from cancer biopsy. The sequencing library is prepared using heat lysis and polyA tail enrichment to only select mRNA data.
Here is my problem, after trimming and assessing sequencing quality (which is good), I notice a very low mapping ratio on my data (between 10 to 30%). I use kallisto for mapping with kmer=31. I check for my reads size distribution, and they are most of the time over the required 31 n lenght.
I checked for contamination using fastq_screen, and most of the reads (>90%) mapped to human genome on a subset of 2M reads.
Do you have any idea where these non-mapping reads comes from? Is it possible that my reads are mostly from intronic region (even with polyA purification, explaining why we see high mapping ratio with bowtie2 from fastq_screen)?
Could be genomic DNA contamination. Enrichment is just that, enrichment, not perfect selection without noise. Poor RNA quality usually increases noise. In the end it does not really matter, since in silico magic cannot save library issues. If on-target counts are too low you have to sequence deeper.
Kallisto is a pseudoaligner, so, as ATpoint suggested, it could be DNA contamination. You can confirm this using an aligner and a reference genome.
Looks like OP has done that, assuming that the samples are human.
I did contamination analysis, and I found nothing. I also run another aligner, STAR, and I could align "only" 49% of my reads with default parameters. So I am quiet lost. When I blast the unmapped reads, they ofter map to BAC clone of different chromosomes for human genoem (my interest) with very high confidence....
It’s possibly due to ribosomal RNA contaminating — those are regions that are difficult to assemble.
I recommend using bowtie2 or kallisto to map solely against ribosomal RNA and see the percentage of reads that map.
It's worth noting that the k-mer size is different from your insert / read size. The value of k _should_ be smaller than your read size, since the k-mer spectrum is used to identify the gene/transcript a read came from.
If you have a lot of reads mapping to the human _genome_ but not the _transcriptome_, this suggests that you have some kind of DNA contamination, which might come from an issue in the polyA enrichment step, or even reverse transcription.
What reference transcriptome did you use for kallisto? Does it contain decoy sequences? I'm wondering what would happen if you add a set of decoy sequences like the salmon docs suggest and try again. If you do this decoy-aware quantification with the human genome as the set of decoys and you find a much higher percentage of reads successfully mapping, then you likely have some human DNA that has made its way through the RNA protocol.
But your follow-up comment about BLASTing the unmapped reads suggests you might have bacterial contamination in your samples, not just DNA contamination in your RNA samples. It might be worth talking with your local wet lab biologist or lab tech to get some advice on this.
Those aren’t bacteria; they are BAC clones — often used for storing fragments of larger genomes.
There’s no reason to re-run pseudoalignment; the STAR BAM results can be used to infer how many reads originate from exon, intron, or intergenic regions (RSeqc or subread should be able to do this).