Question: Get gene length with R
1
gravatar for user31888
2.4 years ago by
user3188890
United States
user3188890 wrote:

How to get the gene length from a list of Ensembl IDs using biomaRt for instance (or any R-based method without having to download a separate annotation file first)?

Since gene_length attribute does not exist in biomaRt, is there a better alternative than using start_position and end_position attributes, then substracting the 2 values like as follows:

library(biomaRt)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
start_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "start_position", martL = human, uniqueRows=T)
end_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "end_position", martL = human, uniqueRows=T)
gene_L <- merge(start_pos, end_pos, by.x="Gene.stable.ID", by.y="Gene.stable.ID")
gene_L$Length <- gene_L$Gene.end..bp. - gene_L$Gene.start..bp.
R • 7.5k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by user3188890

end_position - start_position is potentially wrong due to splicing. Introns should probably not get counted, although you did not explain your application.

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k

You are right. I have a numeric gene expression matrix (in CPM) that I want to convert into FPKM. That's why I was looking for a way to get gene length. So do you think taking introns into account would matter here?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by user3188890

Absolutely.

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k

The EDASeq getGeneLengthAndGCContent indeed takes exons (see line 109 of the code here). Because the CPM matrix was generated with HTSeq-count, I think I should use EDASeq and skip the intron, no? Just to fit with the same method.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by user3188890

My understanding of gene is represented by this pic: https://upload.wikimedia.org/wikipedia/commons/5/54/Gene_structure_eukaryote_2_annotated.svg. esp DNA part. I guess OP requirement is total length of exons (all possible exons). This is different from gene length (IMO). Gene length from NCBI/Ensembl atleast cover all known transcripts (for gene). Gene length calculated by EDAseq doesn't make sense to me esp calling it gene length. So please take whatever is suitable for analysis. Code is provided for either case.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by cpad011214k
2
gravatar for cpad0112
2.4 years ago by
cpad011214k
India
cpad011214k wrote:
library (EDASeq)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
getGeneLengthAndGCContent(ensembl_list, "hsa")

Connecting to BioMart ...
Downloading sequences ...
                length        gc
ENSG00000000003   4535 0.3995590
ENSG00000000419   1207 0.4159072
ENSG00000000457   6883 0.4117391
ENSG00000000460   5967 0.4298643

@OP: I think code is unnecessarily complex (IMO) for the logic in OP and it can be shortened further by:

> library(biomaRt)
> ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
> human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
> gene_coords=getBM(attributes=c("hgnc_symbol","ensembl_gene_id", "start_position","end_position"), filters="ensembl_gene_id", values=ensembl_list, mart=human)
> gene_coords$size=gene_coords$end_position - gene_coords$start_position
> gene_coords

  hgnc_symbol ensembl_gene_id start_position end_position   size
1      TSPAN6 ENSG00000000003      100627109    100639991  12882
2        DPM1 ENSG00000000419       50934867     50958555  23688
3       SCYL3 ENSG00000000457      169849631    169894267  44636
4    C1orf112 ENSG00000000460      169662007    169854080 192073

However these number are not matching with those from EDASeq. I checked the coordinates of gene TSPAN6 ( ENSG00000000003) on NCBI. In NCBI gene (7105) start and stop respectively are: 100627108 and 100637104. Start coordinate on NCBI is off by 1 base from biomart results and stop coordinate is farther away (on chromosome x) compared to end position on NCBI. When viewed (the same gene) in IGV (hg38), gene start and stop positions are: chrX:100,625,108-100,639,104. In general, for a given gene, IGV display coordinates are 2kb up and down stream of gene start and end. Thus hg38 coordinates (from IGV) and GRCh 38 match (quick way). EDAseq probably is doing sum of exon lengths and calculating GC content based on that.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by cpad011214k

Since I am looking for a way to convert CPM to FPKM (the gene length would be scaled to 10^6), few bases difference between EDASeq and NCBI/IGV may not be a big change.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by user3188890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1312 users visited in the last hour