How to get the gene length from a list of Ensembl IDs using biomaRt for instance (or any R-based method without having to download a separate annotation file first)?
Since gene_length attribute does not exist in biomaRt, is there a better alternative than using start_position and end_position attributes, then substracting the 2 values like as follows:
library(biomaRt)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
start_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "start_position", martL = human, uniqueRows=T)
end_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "end_position", martL = human, uniqueRows=T)
gene_L <- merge(start_pos, end_pos, by.x="Gene.stable.ID", by.y="Gene.stable.ID")
gene_L$Length <- gene_L$Gene.end..bp. - gene_L$Gene.start..bp.
end_position - start_positionis potentially wrong due to splicing. Introns should probably not get counted, although you did not explain your application.You are right. I have a numeric gene expression matrix (in CPM) that I want to convert into FPKM. That's why I was looking for a way to get gene length. So do you think taking introns into account would matter here?
Absolutely.
The EDASeq
getGeneLengthAndGCContentindeed takes exons (see line 109 of the code here). Because the CPM matrix was generated with HTSeq-count, I think I should use EDASeq and skip the intron, no? Just to fit with the same method.My understanding of gene is represented by this pic: https://upload.wikimedia.org/wikipedia/commons/5/54/Gene_structure_eukaryote_2_annotated.svg. esp DNA part. I guess OP requirement is total length of exons (all possible exons). This is different from gene length (IMO). Gene length from NCBI/Ensembl atleast cover all known transcripts (for gene). Gene length calculated by EDAseq doesn't make sense to me esp calling it gene length. So please take whatever is suitable for analysis. Code is provided for either case.
EDASeq is an overkill for this task, (if someone is interested only in gene lengths), here are a few alternatives.
https://bioinformatics.stackexchange.com/questions/4942/finding-gene-length-using-ensembl-id
Especially, the answer regarding the GenomicFeatures library.