Question

Why Salmon produces different quantification results compared with featureCounts for lncRNA genes?

0

Entering edit mode

3.8 years ago

biock ▴ 60

Note: The original title is "Why Salmon produces very different quantification results compared with featureCounts for lncRNA genes?" But later I found that if I run salmon with all transcripts but not only with protein-coding or lncRNA genes, the correlation between featureCounts and salmon became higher.

Hello, recently I analyzed about 20 RNA-seq samples. I adopted two approaches to quantify the expression level of genes.

STAR mapping -> featureCounts (only use uniquely mapping reads)
salmon quantification -> summarize isoform-level expression into gene-level by tximport

I compared the quantification results from two methods, and calculated the correlation of coding genes and lncRNA genes, separately. The table showed the correlation of each sample (only listed 5 samples, NOTE: coding transcripts fasta and lncRNA fasta were used by salmon for quantifying, seperately):

The quantification results of salmon and featureCounts correlate very well for coding genes, but for lncRNA genes, the correlation of them is extremely low.

Table Update: I've mentioned that I quantified lncRNA and protein coding gene using salmon. But I may have used inappropriate transcript fasta files for quantification: the lncRNA gene and protein coding gene were quantified with gencode.v34lift37.pc_transcripts.fa and gencode.v34lift37.lncRNA_transcripts.fa, separately. If I use all transcripts (gencode.v34lift37.transcripts.fa), the results became quite different:

Though the correlation (Pearson correlation) of lncRNA is still lower than that of protein coding gene, it is no longer so large.

RNA-Seq next-gen • 4.4k views

ADD COMMENT • link 3.8 years ago by biock ▴ 60

0

Entering edit mode

Sry for the simple question, but did you maybe use different gtf file versions for the runs?

ADD REPLY • link 3.8 years ago by caggtaagtat ★ 1.9k

2

Entering edit mode

Thank you for reminding me of this, I think the reason is that I quantified transcripts with protein-coding transcripts and lncRNA transcripts separately. If I run salmon with all transcripts, the difference between coding/lncRNA gene become smaller.

ADD REPLY • link 3.8 years ago by biock ▴ 60

score 2 · Answer 1 · 2020-07-10

2

Entering edit mode

3.8 years ago

jordi.planells ▴ 480

As far as I know, salmon takes multimappers into account, maybe there you find the difference. Could you align with STAR allowing multimappers, quantify with featureCounts with -M flag and get the correlation again? I am very curious on how this will behave

ADD COMMENT • link 3.8 years ago by jordi.planells ▴ 480

0

Entering edit mode

This is almost certainly it; salmon was intended to be smart about handling reads which have an ambiguous feature assignment, FeatureCounts was not. And uniquely mapping to a genome is not the same as uniquely mapping to one and only one gene feature. So including genes that multimap might not make that much of a difference, the issue is likely that features overlap.

ADD REPLY • link 3.8 years ago by swbarnes2 14k

0

Entering edit mode

Thank you. I run featureCounts on your advice but I found the correlation didn't change a lot compared with the updated version quantification.

ADD REPLY • link 3.8 years ago by biock ▴ 60

0

Entering edit mode

Have you realigned allowing multimappers? If you run featureCounts on the same bam file (in which you have filtered out the multimappers) won't do any difference. I have been using featureCounts with and without the -M flag and definitely I see a difference in the quantification. I can suggest you to use the --fraction flag as well as -O flag. It will count as a fraction every read that is overlapping with more than 1 feature and the -O will assign the read to all the overlapping features.

Additionally, if you are working in human, you could give this tool a look Disclaimer: I haven't used it personally because I'm not working in human samples.

Hope it helps!!

ADD REPLY • link 3.8 years ago by jordi.planells ▴ 480

score 0 · Answer 2 · 2020-07-10

0

Entering edit mode

3.8 years ago

caggtaagtat ★ 1.9k

Apparantly, it is not uncommon for lncRNA transcripts to overlap with coding genes. Maybe the different settings, how you count reads which overlap two transcripts in salmon and feature count led to this difference.

ADD COMMENT • link 3.8 years ago by caggtaagtat ★ 1.9k