Question: Predicted gene differentially expressed in Kallisto, not in FeatureCounts
0
gravatar for Bastien Hervé
20 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

I'm currently trying to output differentially expressed (DE) genes from mm10 RNA-seq data. I did in parallel a genomic approche (Star -> featureCounts -> DESeq2 -> Heatmap) and a transcriptomic approche (Kallisto -> DESeq2 -> Heatmap) to intersect my results.

My results tend to overlap, except a gene which is highly expressed in Kallisto but not in featureCounts.

This gene is a predicted gene, Gm4737. The counts for this gene, after librairies size normalization are :

dds <- DESeq(dds)
dds_norm <- counts(dds, normalized=TRUE)

Approche/Condition    A1    B1    C1    A2    B2    C2    A3    B3    C3
     Kallisto         0     1019  0     0     1416  1226  0     1209  34
   FeatureCounts      306   581   295   230   531   501   138   457   249

My reference genome :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/GRCm38.p5.genome.fa.gz;

My reference transcriptome :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/gencode.vM16.transcripts.fa.gz;

My annotation file :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/gencode.vM16.chr_patch_hapl_scaff.annotation.gtf.gz;

I used the same annotation file in both approche.

Any ideas about the huge counts difference between the two approches ? Does kallisto has problems with predicted genes ?

Thanks !

kallisto featurecounts • 997 views
ADD COMMENTlink modified 20 months ago by h.mon28k • written 20 months ago by Bastien Hervé4.5k
2

It is interesting that you have a specific example but it may not be surprising due to fundamental difference between methodologies, alignment (STAR) and mapping (kallisto) (See A: Alignment and mapping ).

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax74k

You seems to be right, according to my IGV visualization. I checked at my pseudoalignments from kallisto 0.44.0, on Gm4737 positions, for A1 condition, there is 0 read mapped, whereas, for aligned reads (Star), on Gm4737 positions, for A1 condition, I got some reads aligned. And, for pseudoalignments, on Gm4737 positions, for B1 condition, there are a lot of reads mapped, which correlates with my counts table above.

Even if the two approches are different, I am still stuned by the colossal difference of counts that are generated on some genes while others are quite similar...

Thanks for the link !

ADD REPLYlink modified 20 months ago • written 20 months ago by Bastien Hervé4.5k
1

Forgot about this post: Mapping to a transcriptome can incorrectly report reads as mapping uniquely.

ADD REPLYlink written 20 months ago by h.mon28k
1
gravatar for h.mon
20 months ago by
h.mon28k
Brazil
h.mon28k wrote:

I suppose this is a mouse predicted gene, right? If you blast Gm4737 (chr6), there is an almost identical gene, Ahcyl1 (chr3), and an almost identical pseudogene, Gm9826 (chr1). Blast says they are 99% identical over 99% Gm4737 length. This probably leads to several reads being multi-mapped, and featureCounts discards them (unless you used -M), while kallisto uses expectation-maximization to optimally apportion reads to genes.

My initial intuition was featureCounts counts would always be lower than kallisto counts, as featureCounts throws away reads that kallisto doesn't, but this is not exactly what happens. You should check these other genes, and see if the reads mapping on one location are also mapping on the other.

ADD COMMENTlink written 20 months ago by h.mon28k

Nice hint ! So yeah this is a mouse predicted gene and I don't allow multimapping. I used featureCounts in R so I don't have the -M option, but I set this one :

allowMultiOverlap=FALSE


"while kallisto uses expectation-maximization to optimally apportion reads to genes."

So you mean that kallisto can't multi-mapped, it will select the "best" gene for each read ? For featureCounts, with my option, every read multimapped will be discard, like you said Kallisto should have higher counts.

I got the results for identical gene and pseudogene (Is there a way to make a simple table in Biostars ?) :

Approche/Condition    A1      B1      C1      A2      B2      C2      A3      B3      C3
       **Gm4737**
        Kallisto      0       1019    0       0       1416    1226    0       1209    34
   FeatureCounts      306     581     295     230     531     501     138     457     249
       **Gm9826**
        Kallisto      0       0       1       0       1       3       0       0       0
   FeatureCounts      54      70      68      54      81      88      42      56      33
       **Ahcyl1**
        Kallisto      3608    3379    3543    3390    3712    3801    3422    3749    2782
   FeatureCounts      3767    3441    3729    3488    3534    3793    3195    3216    2876

My opinion is that Gm4737 is overexpressed because Ahcyl1 and Gm4737 are too similar to be specificaly identified. In both approche some reads are affected to Gm4737 (and Gm9826).

What's bothering me is, for kallisto approche, in condition A1, I got full reads on Ahcyl1 and for condition B1, I got 1/4 on Gm4737 and 3/4 on Ahcyl1. What's happended here ? Why suddenly for condition B kallisto selected Gm4737 over Ahcyl1 ?

ADD REPLYlink modified 20 months ago • written 20 months ago by Bastien Hervé4.5k
1

So you mean that kallisto can't multi-mapped, it will select the "best" gene for each read ?

No, it will selected the best distribution of ambiguously mapped reads based on the distribution of unambiguously mapped reads - that is, it may add fractions of counts to several transcripts / genes. That is the reason you have counts like "ENSMUST00000059524.6 : 1105.28" for kallisto, but not for featureCounts.

ADD REPLYlink written 20 months ago by h.mon28k

Oh, by exemple let say that a read can map on transcript1 and transcript2.

With 100 unambiguously mapped reads.

transcript1 has 75 unambiguously mapped reads and transcript2 has 25 unambiguously mapped reads.

So kallisto will split the read fairly according to the unambiguously mapped reads, like 0.75 to transcript1 and 0.25 to transcript2 ?

ADD REPLYlink written 20 months ago by Bastien Hervé4.5k

Do you get the same answer if you re-run kallisto for cond B?

ADD REPLYlink written 20 months ago by genomax74k

I forgot to mention that all the counts above are from DESeq2 counts, so after the sizeFactor normalization. That is not the raw counts.

I retrieve the raw counts for condition B1 below. I re-run kallisto on B1 and I got the same results.

Condition B1

Kallisto

Gm4737

  • ENSMUST00000059524.6 : 1105.28

Gm9826

  • ENSMUST00000029124.7 : 0

Ahcyl1

  • ENSMUST00000029490.14 : 3335.94

  • ENSMUST00000153623.1 : 48.3669

  • ENSMUST00000138091.7 : 401.973

  • ENSMUST00000153530.1 : 560.809

  • ENSMUST00000138116.1 : 58.8604

  • ENSMUST00000144864.7 : 4.40848

  • ENSMUST00000137583.1 : 15.4503

  • ENSMUST00000151935.1 : 58.0011

FeatureCounts

Gm4737

  • ENSMUSG00000048087.6 : 666

Gm9826

  • ENSMUSG00000048538.7 : 81

Ahcyl1

  • ENSMUSG00000027893.14 : 3942

Hope this will help !

ADD REPLYlink modified 20 months ago • written 20 months ago by Bastien Hervé4.5k

Do you know how to deal with this kind of pseudogenes in kallisto ? Gm4737 counts in featureCounts are broadly equal through my conditions. i was expected more or less the same thing using kallisto, even if (like said above) these are different methodologies. What would you do in this case ? "Silence" pseudogenes, keep them in DE analysis...?

ADD REPLYlink written 20 months ago by Bastien Hervé4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1742 users visited in the last hour