mapping gene type or biotype to ENSEMBL ID
Entering edit mode
5 months ago
basuanubhav ▴ 70

Hi all,

I have a list of ENSEMBL ids of human lncRNA's for which I am trying to figure out the gene type (or biotype) eg. lincRNA, processed transcript, antisense, sense_overlapping, etc. Now, I have a GTF file (GENCODE v30 ) which contains the gene_id and gene_type argument in the 9th column. I could somehow try to use a code/script to map my IDs using this GTF file, but I was wondering whether there was an easier way to do it using online tools? I tried biomaRt but the current version of the ensemble release collapses the various types of lncRNAs to a single type ie. lncRNA. I really want to get the subtypes for each lncRNA.

P.S. The last GENCODE version with the lncRNA type 'split up' is the v30.

Thanks in advance :)

Annotation ensembl biomaRt • 266 views
Entering edit mode

Since you have not provided any example ID's I can't check but I suggest taking a look at RNACentral.

Entering edit mode

Thanks a lot!! Ill check it out :)

Entering edit mode
5 months ago

I'll use the latest gencode GTF for humans as an example.

curl | \
gunzip > human_gencode_36.gtf

You can import the GTF file into R using rtracklayer::import and keep only the data from the GTF you want.


gtf <- import("human_gencode_36.gtf") %>%
  as_tibble %>%
  distinct(gene_id, gene_name, gene_type)

> gtf
# A tibble: 60,660 x 3
   gene_id           gene_type                          gene_name  
   <chr>             <chr>                              <chr>      
 1 ENSG00000223972.5 transcribed_unprocessed_pseudogene DDX11L1    
 2 ENSG00000227232.5 unprocessed_pseudogene             WASH7P     
 3 ENSG00000278267.1 miRNA                              MIR6859-1  
 4 ENSG00000243485.5 lncRNA                             MIR1302-2HG
 5 ENSG00000284332.1 miRNA                              MIR1302-2  
 6 ENSG00000237613.2 lncRNA                             FAM138A    
 7 ENSG00000268020.3 unprocessed_pseudogene             OR4G4P     
 8 ENSG00000240361.2 transcribed_unprocessed_pseudogene OR4G11P    
 9 ENSG00000186092.6 protein_coding                     OR4F5      
10 ENSG00000238009.6 lncRNA                             AL627309.1 
# … with 60,650 more rows

Let's say that you have a vector of gene_ids that you wanted to get the information for.

genes <- sample(gtf$gene_id, 5)

> genes
[1] "ENSG00000287105.1"  "ENSG00000254060.1"  "ENSG00000271538.6" 
[4] "ENSG00000148399.13" "ENSG00000234648.1"

You can simply filter the imported data using this vector.

> filter(gtf, gene_id %in% genes)
# A tibble: 5 x 3
  gene_id            gene_type            gene_name 
  <chr>              <chr>                <chr>     
1 ENSG00000271538.6  lncRNA               LINC02427 
2 ENSG00000287105.1  lncRNA               AC090577.1
3 ENSG00000254060.1  lncRNA               AC022778.1
4 ENSG00000148399.13 protein_coding       DPH7      
5 ENSG00000234648.1  processed_pseudogene AL162151.2
Entering edit mode

Ah, thanks a lot for the prompt and clear answer!! Actually, after posting the question I tried the rtracklayer::import function and saved the GTF as a dataframe where one column was gene type. After that (since I'm not very confident with dplyr), I just ran a loop over my ENSEMBLID's and used the 'match' function to get the corresponding gene type from the data frame. So, I think we approached it the same way, but I am sure the dataframe can be better manipulated using dplyr to get very fine-tuned information.

So, thanks again for the answer :)


Login before adding your answer.

Traffic: 2812 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6