Question

mapping gene type or biotype to ENSEMBL ID

0

Entering edit mode

3.3 years ago

basuanubhav ▴ 140

Hi all,

I have a list of ENSEMBL ids of human lncRNA's for which I am trying to figure out the gene type (or biotype) eg. lincRNA, processed transcript, antisense, sense_overlapping, etc. Now, I have a GTF file (GENCODE v30 ) which contains the gene_id and gene_type argument in the 9th column. I could somehow try to use a code/script to map my IDs using this GTF file, but I was wondering whether there was an easier way to do it using online tools? I tried biomaRt but the current version of the ensemble release collapses the various types of lncRNAs to a single type ie. lncRNA. I really want to get the subtypes for each lncRNA.

P.S. The last GENCODE version with the lncRNA type 'split up' is the v30.

Thanks in advance :)

Annotation ensembl Org.Hs.eg.db biomaRt • 1.7k views

ADD COMMENT • link updated 3.3 years ago by rpolicastro 13k • written 3.3 years ago by basuanubhav ▴ 140

1

Entering edit mode

Since you have not provided any example ID's I can't check but I suggest taking a look at RNACentral.

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot!! Ill check it out :)

ADD REPLY • link 3.3 years ago by basuanubhav ▴ 140

score 1 · Answer 1 · 2021-01-05

I'll use the latest gencode GTF for humans as an example.

curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz | \
gunzip > human_gencode_36.gtf

You can import the GTF file into R using rtracklayer::import and keep only the data from the GTF you want.

library("tidyverse")
library("rtracklayer")

gtf <- import("human_gencode_36.gtf") %>%
  as_tibble %>%
  distinct(gene_id, gene_name, gene_type)

> gtf
# A tibble: 60,660 x 3
   gene_id           gene_type                          gene_name  
   <chr>             <chr>                              <chr>      
 1 ENSG00000223972.5 transcribed_unprocessed_pseudogene DDX11L1    
 2 ENSG00000227232.5 unprocessed_pseudogene             WASH7P     
 3 ENSG00000278267.1 miRNA                              MIR6859-1  
 4 ENSG00000243485.5 lncRNA                             MIR1302-2HG
 5 ENSG00000284332.1 miRNA                              MIR1302-2  
 6 ENSG00000237613.2 lncRNA                             FAM138A    
 7 ENSG00000268020.3 unprocessed_pseudogene             OR4G4P     
 8 ENSG00000240361.2 transcribed_unprocessed_pseudogene OR4G11P    
 9 ENSG00000186092.6 protein_coding                     OR4F5      
10 ENSG00000238009.6 lncRNA                             AL627309.1 
# … with 60,650 more rows

Let's say that you have a vector of gene_ids that you wanted to get the information for.

genes <- sample(gtf$gene_id, 5)

> genes
[1] "ENSG00000287105.1"  "ENSG00000254060.1"  "ENSG00000271538.6" 
[4] "ENSG00000148399.13" "ENSG00000234648.1"

You can simply filter the imported data using this vector.

> filter(gtf, gene_id %in% genes)
# A tibble: 5 x 3
  gene_id            gene_type            gene_name 
  <chr>              <chr>                <chr>     
1 ENSG00000271538.6  lncRNA               LINC02427 
2 ENSG00000287105.1  lncRNA               AC090577.1
3 ENSG00000254060.1  lncRNA               AC022778.1
4 ENSG00000148399.13 protein_coding       DPH7      
5 ENSG00000234648.1  processed_pseudogene AL162151.2