Question

How to create tx2gene object with gene names from annotation

0

Entering edit mode

4.2 years ago

poecile.pal ▴ 50

Hello all,

I would like to have tx2gene object with 3 columns - the third column should be with gene names like PLA2G4A etc. The using annotation looks like this:

##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-09-05
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Firstly I created tx2gene object with 2 columns:

> txdb <- makeTxDbFromGFF(file="gencode.v32.annotation.gtf")
> saveDb(x=txdb, file = "gencode.v32.annotation.TxDb")
> k <- keys(txdb, keytype = "TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME") 
> head(tx2gene)
             TXNAME            GENEID
1 ENST00000456328.2 ENSG00000223972.5
2 ENST00000450305.2 ENSG00000223972.5
3 ENST00000473358.1 ENSG00000243485.5
4 ENST00000469289.1 ENSG00000243485.5
5 ENST00000607096.1 ENSG00000284332.1
6 ENST00000606857.1 ENSG00000268020.3

But how should I paste the third column with gene names? There are 2 problems: 1) The main problem - I can't find gene names in the list of columns(txdb):

> columns(txdb)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"   "CDSSTART"  
 [7] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"  
[13] "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"      
[19] "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"

2) Even if they were in txdb, how can I add a third column?

Thank you very much!

Best regards, Poecile

RNA-Seq R software error • 7.3k views

ADD COMMENT • link updated 4.2 years ago by Papyrus ★ 2.9k • written 4.2 years ago by poecile.pal ▴ 50

score 0 · Answer 1 · 2020-02-18

You can try with the org.Hs.eg.db package, which has "SYMBOL" (gene names) and "ENSEMBL" (gene ID), "ENSEMBLTRANS" (tx ID):

> library(org.Hs.eg.db)
> columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"     
> keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"

As such (example with the first 5 gene ID keys):

> select(org.Hs.eg.db, keys= keys(org.Hs.eg.db, keytype="ENSEMBL")[1:5] , columns=c("SYMBOL","ENSEMBL","ENSEMBLTRANS"), keytype="ENSEMBL")
'select()' returned 1:many mapping between keys and columns
          ENSEMBL SYMBOL    ENSEMBLTRANS
1 ENSG00000121410   A1BG            <NA>
2 ENSG00000175899    A2M            <NA>
3 ENSG00000256069  A2MP1 ENST00000543404
4 ENSG00000256069  A2MP1 ENST00000566278
5 ENSG00000256069  A2MP1 ENST00000545343
6 ENSG00000256069  A2MP1 ENST00000544183
7 ENSG00000171428   NAT1            <NA>
8 ENSG00000156006   NAT2 ENST00000286479
9 ENSG00000156006   NAT2 ENST00000520116

By the way, if you're going to input your ENSEMBL IDs, you may sometimes want to remove the last part of the name (version). This can be easily done with stringr::str_split_fixed