Question

Low Mapping Rate with Kallisto on RNA-seq Data

0

Entering edit mode

8 weeks ago

meetmet ▴ 10

Hello,

In my current job, I am dealing with perturbation rna-seq data from cancer biopsy. The sequencing library is prepared using heat lysis and polyA tail enrichment to only select mRNA data.

Here is my problem, after trimming and assessing sequencing quality (which is good), I notice a very low mapping ratio on my data (between 10 to 30%). I use kallisto for mapping with kmer=31. I check for my reads size distribution, and they are most of the time over the required 31 n lenght.

I checked for contamination using fastq_screen, and most of the reads (>90%) mapped to human genome on a subset of 2M reads.

Do you have any idea where these non-mapping reads comes from? Is it possible that my reads are mostly from intronic region (even with polyA purification, explaining why we see high mapping ratio with bowtie2 from fastq_screen)?

RNAseq mapping ratio kallisto low • 12k views

ADD COMMENT • link updated 8 weeks ago by dsull ★ 7.8k • written 8 weeks ago by meetmet ▴ 10

0

Entering edit mode

Could be genomic DNA contamination. Enrichment is just that, enrichment, not perfect selection without noise. Poor RNA quality usually increases noise. In the end it does not really matter, since in silico magic cannot save library issues. If on-target counts are too low you have to sequence deeper.

ADD REPLY • link 8 weeks ago by ATpoint 90k

0

Entering edit mode

Kallisto is a pseudoaligner, so, as ATpoint suggested, it could be DNA contamination. You can confirm this using an aligner and a reference genome.

ADD REPLY • link 8 weeks ago by Buffo ★ 2.5k

0

Entering edit mode

I checked for contamination using fastq_screen, and most of the reads (>90%) mapped to human genome on a subset of 2M reads.

Looks like OP has done that, assuming that the samples are human.

ADD REPLY • link 8 weeks ago by GenoMax 154k

0

Entering edit mode

I did contamination analysis, and I found nothing. I also run another aligner, STAR, and I could align "only" 49% of my reads with default parameters. So I am quiet lost. When I blast the unmapped reads, they ofter map to BAC clone of different chromosomes for human genoem (my interest) with very high confidence....

ADD REPLY • link 8 weeks ago by meetmet ▴ 10

1

Entering edit mode

It’s possibly due to ribosomal RNA contaminating — those are regions that are difficult to assemble.

I recommend using bowtie2 or kallisto to map solely against ribosomal RNA and see the percentage of reads that map.

ADD REPLY • link 8 weeks ago by dsull ★ 7.8k

0

Entering edit mode

I use kallisto for mapping with kmer=31. I check for my reads size distribution, and they are most of the time over the required 31 n lenght.

It's worth noting that the k-mer size is different from your insert / read size. The value of k _should_ be smaller than your read size, since the k-mer spectrum is used to identify the gene/transcript a read came from.

If you have a lot of reads mapping to the human _genome_ but not the _transcriptome_, this suggests that you have some kind of DNA contamination, which might come from an issue in the polyA enrichment step, or even reverse transcription.

What reference transcriptome did you use for kallisto? Does it contain decoy sequences? I'm wondering what would happen if you add a set of decoy sequences like the salmon docs suggest and try again. If you do this decoy-aware quantification with the human genome as the set of decoys and you find a much higher percentage of reads successfully mapping, then you likely have some human DNA that has made its way through the RNA protocol.

But your follow-up comment about BLASTing the unmapped reads suggests you might have bacterial contamination in your samples, not just DNA contamination in your RNA samples. It might be worth talking with your local wet lab biologist or lab tech to get some advice on this.

ADD REPLY • link 8 weeks ago by james.hawley ▴ 100

0

Entering edit mode

Those aren’t bacteria; they are BAC clones — often used for storing fragments of larger genomes.

There’s no reason to re-run pseudoalignment; the STAR BAM results can be used to infer how many reads originate from exon, intron, or intergenic regions (RSeqc or subread should be able to do this).

ADD REPLY • link 8 weeks ago by dsull ★ 7.8k