Convert absolute count into TPMs, merge exons or average gene length?
1
0
Entering edit mode
3.1 years ago
cyntsc10 • 0

Hi everyone

I am trying to convert absolute count into TPMs. But I am wondering if does already exists a tool to do it taking as input an annotation file GFF/GFF3 and the gene-feature ... Does this exist? I browsed a while and I did not find anything practical.

And I'm also wondering if in your experience is better to merge the exons of each gene and count the resulted length or just get the average length of each gene?

Thanks,

Cynthia

RNA-Seq gene • 1.0k views
ADD COMMENT
0
Entering edit mode
3.1 years ago
Mensur Dlakic ★ 27k

I don't think you need a tool to convert absolute counts to TPM, as the formula is simple. If you want a tool that takes the reads and GFF files and goes straight to TPM, you may want to try kallisto.

ADD COMMENT
0
Entering edit mode

Hi @Mensur, I know there are software to get for absolute, FPKM or TPM values as output, perhaps I did not offer enough information in my question. My approach to estimate gene expression was quantification aligned to genome, thus I used HTSeq-count and this tool does not output TPMs. And, what I want now is to normalize my quantification files into TPMs. To get this I need to get the gene length as the formula states, so, I built a vector from the annotation about my feature of interest (CDS). The issue here is that as there are variants for some genes, I need to set a criteria to keep one length by gene. So... Which gene among these variants should I choose to extract the length? ... The first gene match? The average length of the genes? ... something else, and the most important thing for me, be clear ... why?

Thanks by advance...

ADD REPLY
0
Entering edit mode

Well, therein lies your problem: You only have gene-level counts.

You can try to get gene-level FPKMs (and then convert to TPMs) via something like: https://rdrr.io/bioc/DESeq2/man/fpkm.html but, without transcript-level quantification, the "gene length" is a rough estimate, at best (as discussed in that link).

This is why I (and many others) strongly prefer transcript-level quantification (e.g. STAR->RSEM, kallisto, etc.) -- which also, by the way, are more accurate in handling ambiguous alignments. From transcript-level estimates, you can get gene-level TPMs and packages like tximport can even give you an estimate of gene-level counts. But when you have only gene-level counts, you lose the transcript-level information that would enable you to figure out "gene length".

Thought experiment: A gene has a short isoform (1000 bp) and a long isoform (20000 bp). Which one should you pick as your gene length? You don't know because you don't whether and to what extent each isoform is expressed.

ADD REPLY
0
Entering edit mode

Hi again@dsull

Reflecting a bit on this, I want to first let you that I totally agree with you, I prefer quantification at the transcription level like many others for its multiple benefits. But, let's to say that my study is exploratory and I want to get information about the collective response during the study of a biological trait x with independent datasets. As it is a meta-analysis, the definition (details) provided by the gene-variants at the transcription level could dilute the signal of interest because the counts will be assigned (split) to each transcript, in addition, my study organism has a well-annotated genome (Arabidopsis). Thus, I think that for an exploratory meta-analysis it is a bit useful to keep the estimations at the gel level.

Thus my questions is about how to convert htseq count into fpkm or tpm?, both need the gene length, so the question is really how to extract these gene lenghts is an effective way from a gff or gtf file?

Digging deeper I found 2 posts two that can be useful, I am going to test them, let you know my experience later. I let the links over here in case someone else be stocked in the same issue than me: A: Normalization Of Rna Sequencing Counts (By Ercc / Gene Length) Convert HTSeq count table to RPKM value using GFF/GTF

I appreciate your advice. Anything else I'll be reading you. Cheers.

ADD REPLY
0
Entering edit mode

Thanks for your response. However, everything I said still stands about loss of GENE-LEVEL accuracy when you discard transcript-level information, and your comment about dilution of signal because the counts will be split among transcripts is not correct. You get more accurate GENE-LEVEL estimates by summarizing transcript-level quantifications.

That said, regardless, please comment on what you decide to do and post your experience.

ADD REPLY

Login before adding your answer.

Traffic: 1497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6