Question: Why Salmon produces different quantification results compared with featureCounts for lncRNA genes?
0
gravatar for biock
29 days ago by
biock50
biock50 wrote:

Note: The original title is "Why Salmon produces very different quantification results compared with featureCounts for lncRNA genes?" But later I found that if I run salmon with all transcripts but not only with protein-coding or lncRNA genes, the correlation between featureCounts and salmon became higher.


Hello, recently I analyzed about 20 RNA-seq samples. I adopted two approaches to quantify the expression level of genes.

  1. STAR mapping -> featureCounts (only use uniquely mapping reads)
  2. salmon quantification -> summarize isoform-level expression into gene-level by tximport

I compared the quantification results from two methods, and calculated the correlation of coding genes and lncRNA genes, separately. The table showed the correlation of each sample (only listed 5 samples, NOTE: coding transcripts fasta and lncRNA fasta were used by salmon for quantifying, seperately):

ddhyaS.png

The quantification results of salmon and featureCounts correlate very well for coding genes, but for lncRNA genes, the correlation of them is extremely low.

Table Update: I've mentioned that I quantified lncRNA and protein coding gene using salmon. But I may have used inappropriate transcript fasta files for quantification: the lncRNA gene and protein coding gene were quantified with gencode.v34lift37.pc_transcripts.fa and gencode.v34lift37.lncRNA_transcripts.fa, separately. If I use all transcripts (gencode.v34lift37.transcripts.fa), the results became quite different:

d2GrBV.png

Though the correlation (Pearson correlation) of lncRNA is still lower than that of protein coding gene, it is no longer so large.

rna-seq next-gen • 212 views
ADD COMMENTlink modified 27 days ago • written 29 days ago by biock50

Sry for the simple question, but did you maybe use different gtf file versions for the runs?

ADD REPLYlink written 28 days ago by caggtaagtat1.1k
2

Thank you for reminding me of this, I think the reason is that I quantified transcripts with protein-coding transcripts and lncRNA transcripts separately. If I run salmon with all transcripts, the difference between coding/lncRNA gene become smaller.

ADD REPLYlink written 27 days ago by biock50
2
gravatar for jordi.planells
28 days ago by
jordi.planells220 wrote:

As far as I know, salmon takes multimappers into account, maybe there you find the difference. Could you align with STAR allowing multimappers, quantify with featureCounts with -M flag and get the correlation again? I am very curious on how this will behave

ADD COMMENTlink written 28 days ago by jordi.planells220

This is almost certainly it; salmon was intended to be smart about handling reads which have an ambiguous feature assignment, FeatureCounts was not. And uniquely mapping to a genome is not the same as uniquely mapping to one and only one gene feature. So including genes that multimap might not make that much of a difference, the issue is likely that features overlap.

ADD REPLYlink written 28 days ago by swbarnes28.1k

Thank you. I run featureCounts on your advice but I found the correlation didn't change a lot compared with the updated version quantification.

d2hmgV.png

ADD REPLYlink written 27 days ago by biock50

Have you realigned allowing multimappers? If you run featureCounts on the same bam file (in which you have filtered out the multimappers) won't do any difference. I have been using featureCounts with and without the -M flag and definitely I see a difference in the quantification. I can suggest you to use the --fraction flag as well as -O flag. It will count as a fraction every read that is overlapping with more than 1 feature and the -O will assign the read to all the overlapping features.

Additionally, if you are working in human, you could give this tool a look Disclaimer: I haven't used it personally because I'm not working in human samples.

Hope it helps!!

ADD REPLYlink written 26 days ago by jordi.planells220
0
gravatar for caggtaagtat
28 days ago by
caggtaagtat1.1k
caggtaagtat1.1k wrote:

Apparantly, it is not uncommon for lncRNA transcripts to overlap with coding genes. Maybe the different settings, how you count reads which overlap two transcripts in salmon and feature count led to this difference.

ADD COMMENTlink written 28 days ago by caggtaagtat1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 714 users visited in the last hour