The best way to get gene lengths for 15K+ genes?
0
0
Entering edit mode
20 months ago
JACKY ▴ 140

Say I have a dataframe of RNA-seq counts, with more than 15,000 genes (ensemble ID). I want to normalize this data from counts to TPM. To do so, I need to get the gene lengths.

I have tried multiple ways, the first is this, which seems to me kind of slumply, and also gave some worrying warnings:

exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
insect = intersect(rownames(counts_df),names(len))
geneLengths =  len[insect]
counts_df = counts_df[insect,]

When I perform the normalization, I get a list of warnings:

Warning messages:
1: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length

The second way does not seem to work for such a big number of genes:

ensembl_list <- rownames(counts_df)
getGeneLengthAndGCContent(ensembl_list, "hsa")

It crashed:

Connecting to BioMart ...
Downloading sequences ...
This may take a few minutes ...
Error in curl::curl_fetch_memory(url, handle = handle) :                                                                 
  Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 88120121 bytes received

How do I get trust worthy gene lengths of counts RNA-seq data?

r EDAseq RNA-seq • 2.5k views
ADD COMMENT
2
Entering edit mode

You should first ensure that calculating TPMs like this is valid for your analysis. Proper TPMs generally need to be calculated from transcript abundances generated from software such as Salmon. This is because using one gene length value ignores that isoforms of various lengths can be expressed by the gene, and the abundance of these isoforms can change between conditions.

ADD REPLY
0
Entering edit mode

What do you think of this code:

  rpkm <- apply(X = subset(counts),
                MARGIN = 2,
                FUN = function(x) {
                  10^9 * x / genelength / sum(as.numeric(x))
                })

  TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

I don't think salmon would fit to my research question.

ADD REPLY
0
Entering edit mode

From my previous experiences in retrieving data in bulk from biomaRt/Ensembl - it is always better to partition your data into batches before retrieving data. So I would rather chop those 15000 genes in batches of 500 and retrieve data per batch. This way you could also put your code to sleep if you overshoot your hourly request quota (in R with Sys.sleep('time in seconds')) Also, I prefer using biomaRt's getBM() function to retrieve such kind of data, in your case it would look something like this:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
gene_lengths =  getBM(attributes=c('ensembl_gene_id','transcript_length','cds_length'), filters =  'ensembl_gene_id', values =c(*gene ids go here*), mart = ensembl)
ADD REPLY
0
Entering edit mode

Hello manaswwm , only now I noticed that you replied. The gene length is still an issue, I'm very skeptical about the gene lengths I got, so I tried the code you provided for a dataset that I have. Originaly the dataset came with ENTREZ IDs, I switched to ensemble and about 20,000 genes were left. Your code produced a very long list of genes with two columns, transcript_length and cds_length, and some genes accure more than once. The list is very long.. 175293 rows, and I don't have this much genes.

Am I using it wrong?

My dataset does not have duplicated genes..

ADD REPLY
0
Entering edit mode

Very probably what you have now is the following: every row is a specific transcript that belongs to a particular gene, hence you have one gene displayed on multiple rows. You are not able to see this as the table that you got back from biomaRt does not have the transcript id, but it is my fault, as I did not include the transcript id within the code snippet so here is the code that you could try instead:

gene_lengths =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id', 'transcript_length','cds_length'), filters =  'ensembl_gene_id', values = c(*gene ids go here*), mart = ensembl, useCache = FALSE)

The output table will still be 175293 rows long, but now you will see the transcript per gene column. Here you will need to decide on a "representative" transcript per gene to get the "representative" gene length - which could mean "transcript_length" if you do not care about the UTRs when calculating the gene length. One of the many ways of doing this is checking for the canonical_transcript per gene, an alternative is to check for the MANE_transcript per gene (more on both here).

For example, if you decide on choosing the canonical transcript then you could ask biomaRt to also mark the canonical transcript within the table with this code:

gene_canonical_transcript =  getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','transcript_is_canonical'), filters =  'ensembl_gene_id', values =c(*list of genes go here*), mart = ensembl, useCache = FALSE)

Again, you will get 175293 rows, but you could subset the rows to contain only the canonical transcripts and their corresponding genes. One of the ways to do is as follows:

gene_canonical_transcript_subset = gene_canonical_transcript[!is.na(gene_canonical_transcript$transcript_is_canonical),]

You can use this gene-->transcript table to subset your original table (that contained all the transcripts) to get a gene-specific length. One way to do this is as follows:

gene_lengths = merge(gene_canonical_transcript_subset, gene_lengths, by = c("ensembl_gene_id", "ensembl_transcript_id"))
ADD REPLY
0
Entering edit mode

First of all I would like to thank for your elaborated answer. I really appreciate it.

Let me show you what each code line produces.. in order:

enter image description here

enter image description here

enter image description here

and the fourth line produced an error : Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

And this is the genelengths that I prodcued, the ones that I'm skeptical about, I would be glad to hear what you think of this:

enter image description here

ADD REPLY
0
Entering edit mode

That is a bit weird, when I run this code on my machine I get something like this after running the fourth line which I imagine is something that you were looking for (having a single "gene length" per gene): biomart image

Can you paste the column names that you have for gene_canonical_transcript_subset and gene_lengths? I see in the fourth figure the only column name is genelength1 which is technically not a column name of any of the two data frames. Could you also paste the exact line of code that you used to get the last figure?

Optionally, to avoid R altogether, you could also use the web interface of biomart hosted here -https://www.ensembl.org/biomart/martview. Select the same attributes and filters from the code above, upload your list of gene IDs (in batches of 500) and you should have the same results

ADD REPLY
0
Entering edit mode

The column names of gene_lengths are :

[1] "ensembl_gene_id"       "ensembl_transcript_id" "transcript_length"     "cds_length"  

and for gene_canonical_transcript_subset:

[1] "ensembl_gene_id"         "ensembl_transcript_id"   "transcript_is_canonical"

I produced the genelengths1 (shown in the last figure) like this:

exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
len = sum(width(exons))
INDEX = intersect(rownames(dataset1),names(len))
geneLengths1 = len[INDEX ]
counts_data1 = counts_data1[INDEX ,]
ADD REPLY
0
Entering edit mode
  1. It is still a bit unclear to me why this merge command does not work for you, but maybe I am overlooking some small typo:

gene_lengths = merge(gene_canonical_transcript_subset, gene_lengths, by = c("ensembl_gene_id", "ensembl_transcript_id"))

  1. I see, so the genelength1 was produced by you. What exactly are you sceptical about in the dataframe? Are there duplicating genes here too? If yes, then it very likely means that you have multiple transcripts per gene. I have not used ensembldb package before, so it is also not clear to me what genelength1 here is, I tried cross-referencing the first gene with the entry page for that gene in Ensembl and I could not find the number 6979.

Accessing biomart through the weblink that I shared could be may be useful?

ADD REPLY
0
Entering edit mode

manaswwm So it worked, turns out I'm the one who overlooked a small typo, apologies. This is the result. Looking at the transcript_length column, as you can see it is very different from the gene lengths that i originaly have. Which should I use?

Perhaps my doubts were right afterall and the gene lengths that I have are not accurate?

enter image description here

ADD REPLY
0
Entering edit mode

It is tough to say which one you should use as I am not sure what you have in the genelength1 data frame. What you have from biomaRt are the lengths extracted directly from Ensembl, which I would personally prefer. Maybe you could manually check a couple of genes as an example, go to their corresponding Ensembl pages and check which of the two data frames have the correct length listed against the gene (or transcript for the data frame from biomaRt).

ADD REPLY
0
Entering edit mode

manaswwm I appreciate your help sir, thank you so much. I will run the analysis using the new values, and see how the results will change. Maybe I'll update you if there are any good news :)

ADD REPLY

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6