Discrepancy in number of pseudoaligned reads for read quantification with Kallisto
2
0
Entering edit mode
4.0 years ago

Hello,

I want to quantify the abundance of reads mapping downstream of genes with kallisto. I have RNA seq data that contains reads arising from read-through transcription (transcription downstream of transcript 3' ends).

I use two different transcriptome files: One reference transcriptome (containing only the real, genic transcript sequences) One modified transcriptome, containing the exact same sequences + sequences of downstream regions

This means that the second, modifed transcriptome has the same genic target sequences + a number of intergenic target sequences.

My problem is:

For some samples the number of pseudo-aligned reads is higher when i use the non-modified transcriptome, despite the modified transcriptome contains the EXACT same sequences, just with a few other target sequences more. I wonder how this is possible, as both de Bruijn graphs contain the same target sequences, the number of pseudo-aligned reads should be equal or higher, not lower. I expected some of the reads, that originally map to genic target sequences when quantified with the non-modified transcriptome, to be aligned to intergenic regions, as the equivalence class of transcripts for this read might be extended with intergenic target sequences.

I double checked if my transcriptome files really contain the same sequences. I would be glad if someone could explain me how it is possible, that some reads cannot be aligned with my modified transcriptome, despite containing the same target sequences.

Thank you!

RNA-Seq kallisto pseudo-alignment • 2.3k views
ADD COMMENT
3
Entering edit mode
4.0 years ago
ATpoint 82k

What you experience is an outcome of the way that kmer-based pseudoalignment works. A read is k-compatible with a target if all of the mappable k-mers from a read occur in that target. When you add the intergenic sequences then there might be k-mers that were not originally mappable, but now become mappable to the new intergenic sequences. Therefore when kallisto takes the strict intersection of all targets, there is no target that is k-compatible with all of the mappable k-mers, hence the read may go un-pseudoaligned.This behavior is inherent to the decision rule used in pseudoalignment, and part of what makes it fast. However, one would expect that alignment based approaches (e.g. STAR, HISAT, etc.) or selective-alignment (as used in recent versions of salmon), would not exhibit this behavior.

See for salmon: https://salmon.readthedocs.io/en/latest/salmon.html

ADD COMMENT
0
Entering edit mode

You are right! Thank you really much! :) Somehow i did not think about this.

there is no target that is k-compatible with all of the mappable k-mers

Now that more k-mers are mappable, these are no longer ignored by Kallisto. The intersection of all k-compatibility classes is no longer guaranteed for reads that did previously pseudo-align.

I will have a look at Salmon's selective alignment approach!

ADD REPLY
0
Entering edit mode
4.0 years ago
bruce.moran ▴ 960

The reads in the non-modified transcriptome may align better to the extra regions included in the modified transcriptome, so you have fewer reads aligned at genic targets than in the non-modified transcriptome.

ADD COMMENT
0
Entering edit mode

Yes, thats what i also expected, some reads previously mapping to genic regions with the normal transcriptome might map to intergenic regions when quantification is realised with the modified one.

But the total number of pseudo-aligned reads should not be less when i use the modified transcriptome. Magically some reads are no longer pseudo-alignable, despite the de Bruijn graph contains the same sequences.

The k-compatibility class of each k-mer of the reads still contain the same transcripts, eventually some more, but not less.

ADD REPLY
0
Entering edit mode

Ah, didn't think you meant the total reads, I thought it was comparative.

I agree it is strange, as you say the targets are identical, excluding those extra in modified, and so you shouldn't really lose any in total.

How are you counting total reads?

When there are multimappers kallisto holds all positions for a read and arbitrarily outputs to one in sleuth, so if you're using a BAM and counting lines this could give you an overestimate in the modified transcriptome.

ADD REPLY
0
Entering edit mode

For the number of total reads i'm referring to the n_pseudoaligned in the run_info.json of the various samples.

When there are multimappers kallisto holds all positions for a read and arbitrarily outputs to one in sleuth

As i understood, for multimappers Kallisto assigns the read to one of the compatible transcripts using the EM algorithm. Is it possible that, for some multimapping reads Kallisto won't assign the reads to any compatible transcript for some reason?

ADD REPLY
0
Entering edit mode

As i understood, for multimappers Kallisto assigns the read to one of the compatible transcripts using the EM algorithm. Is it possible that, for some multimapping reads Kallisto won't assign the reads to any compatible transcript for some reason?

I was taking info from the link I provided and meant explicitly for the BAM output.

ADD REPLY

Login before adding your answer.

Traffic: 1758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6