Question

Estimated gene counts with tximport

1

Entering edit mode

2.3 years ago

bioinfo ▴ 150

Hello,

I am analyzing bulk RNA seq data and I used Kallisto to align my data to the transcriptome. Then, I used tximport to assign the gene names from ensembl to my counts. I am comparing the results I analyzed currently to some data that were run 4 years ago and I noticed that in the data from 4 years ago I ended up with an estimated gene counts table with ~50000 genes while now I have about half. Is it possible to see which version of the gene annotation I am using? Is it possible that the difference in the overall amount of genes could be that there was an update on the Ensembl dataset I am using?

I am using the Ensembl dataset using the code below:

mart <- biomaRt::useMart("ensembl", hsapiens_gene_ensembl, host = "uswest.ensembl.org", ensemblRedirect = FALSE)

I also noticed that the estimated gene counts from 4 years ago contains thousand of gene names that are similar to AC253536.2 (they all start with AC) but the version I am using now does not output any gene names like this. Does anyone know why those were removed?

Thank you

RNA-seq ensembl tximport kallisto • 926 views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 2.3 years ago by bioinfo ▴ 150

score 2 · Answer 1 · 2022-01-05

2

Entering edit mode

2.3 years ago

Ben_Ensembl ★ 2.4k

Ensembl retired clone-based gene names at the beginning of last year. More information can be found in the following blog post: https://www.ensembl.info/2021/03/15/retirement-of-clone-based-gene-names/

ADD COMMENT • link 2.3 years ago by Ben_Ensembl ★ 2.4k

score 1 · Answer 2 · 2022-01-05

The only way to find out what version of Ensembl you used for the quantification with Kalisto is to know the source of the transcript reference Fasta file that you used for your analysis.

However, if your old analysis had ~50k genes, and now you have ~20k genes, it seems likely that the old analysis used a gene set that included both coding and non-coding genes, and your new analysis used only coding genes.