How to get read counts on transcript level using featurecounts?
3
1
Entering edit mode
2.4 years ago
newbie ▴ 100

I have used featureCounts to get the read counts one a gene level with the following command:

featureCounts -a gencode.gtf -F GTF -p -s 2 -T 8 -o sample.txt sample.sorted.bam


I'm interested in extracting the counts on a transcript level. My data is paired end and reverse stranded. I'm a bit confused with command I have to use to get the counts on transcript level featureCounts manual

Can anyone please tell me how to get the counts on transcript level? thanq

RNA-Seq featurecounts geneexpression • 6.2k views
16
Entering edit mode
2.4 years ago
Gordon Smyth ★ 4.7k

I am one of the featureCounts authors.

Do you want do conduct analyses for genes, or do you need individual results for all the isoform variations of each gene? The "isoforms" tend to be called "transcripts" in the RNA-seq literature. The featureCounts help page you link to is designed to generate gene-level counts. We have never recommended the use of featureCounts to produce transcript-level counts.

A gene level analysis is easiest to interpret and is appropriate for most analyses where you want to interpret the results in terms of pathways, GO terms, KEGG and so on. featureCounts will give you gene level counts very quickly and easily. Alternatively, featureCounts can give you counts for each exon in each gene.

If you do really want to quantify all the different transcripts (isoforms) for each gene, then this cannot be done by featureCounts or by any other read counting software. While featureCounts can be run with transcript annotation, the results won't be particularly useful. Since the transcripts are heavily overlapping, with many transcripts sharing the same exons, featureCounts cannot tell which transcript to assign each read to. It therefore has to either discard all the multi-assigned reads or alternatively assign them multiple times to all the overlapping transcripts -- and either choice is bad.

If you really do want the (considerable) extra complexity of transcript-level analysis, then I would run Salmon and read the results into edgeR using catchSalmon. Salmon can assign reads to transcripts on a probabilistic basis. Or alternatively, do an exon-level analysis. That allows you see what is going on, although the results are not assembled into transcripts. Exon-level may be a safer approach if your data has expressed transcripts that have not been previously annotated.

1
Entering edit mode

The estimation of transcript-level counts is necessary to obtain accurate gene-level abundance estimates, so it should be done even if only interested in genes and not transcripts per se. See, e.g.

3
Entering edit mode
2.4 years ago
Asaf 8.9k

You'll probably want to use RSEM or stringTie

0
Entering edit mode

you mean I can't get transcript level read counts with featureCounts?

1
Entering edit mode

If I read their paper right then no, you can't. You'll have to use methods that makes the best guess for a transcript abundance. Take a look at RSEM paper

0
Entering edit mode

small question. I have the sorted.bam files from Hisat2 alignment. Can I use those bams for RSEM to calculate transcript expression?

1
Entering edit mode

You will have needed to align to the transcriptome, which one can certainly do with hisat2 but is rather non-standard. You will also have needed to heavily tweak the settings to prevent splicing and aligning with an InDel, since RSEM can't handle these. Also, you'd need to have Hisat2 produce more alignments than you'll get by default. In other words, while it might be theoretically possible to use RSEM have Hisat2, it'd require a fair bit of work and it'd be easier to just start over from the fastq files.

If you'd like to use RSEM, follow the instructions there for using STAR upstream.

0
Entering edit mode
2
Entering edit mode
2.4 years ago
ATpoint 62k

I vote for quantification of reads against a transcriptome with salmon, followed by analysis e.g. with swish. Salmon is part of the extended "ecosystem" that is maintained by the labs involving Rob Patro (salmon developer) and Mike Love (tximport, DESeq2, FishPond/swish developer/PI), so I guess it is most streamlined to use this one. In any case tools like salmon that use pseudoalignment generally outperform traditional alignment when it comres to RNA-seq quantification accuracy, here is another very recent atricle benchmarking this. https://academic.oup.com/gigascience/article/8/12/giz145/5663671

Here is the salmon paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/

0
Entering edit mode

The paper linked to concludes that:

"Because both Kallisto and Salmon performed highly concordantly with the ground truth, they were also highly concordant with each other, as previously reported [34,35]. They cluster together in the method similarity matrix before clustering with the ground truth (Fig. 4). However, Kallisto was faster than Salmon, used less memory (Fig. 7), and performed better at sample- and gene-level comparison when examining Spearman’s correlation, Euclidean distance, mean squared error, and adjusted R2 values as compared with the ground truth, especially for the 2 reverse-stranded datasets (Additional File 7)."

0
Entering edit mode

Hey Lior, what is the 'ground truth'?

0
Entering edit mode

Your question is answered in detail in the paper (by the way, to be clear, it is not my paper and I had nothing to do with it).

0
Entering edit mode

Thanks - I will read it on my next travel itinerary