How to create tx2gene object with gene names from annotation
1
0
Entering edit mode
4.7 years ago
poecile.pal ▴ 50

Hello all,

I would like to have tx2gene object with 3 columns - the third column should be with gene names like PLA2G4A etc. The using annotation looks like this:

##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-09-05
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Firstly I created tx2gene object with 2 columns:

> txdb <- makeTxDbFromGFF(file="gencode.v32.annotation.gtf")
> saveDb(x=txdb, file = "gencode.v32.annotation.TxDb")
> k <- keys(txdb, keytype = "TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME") 
> head(tx2gene)
             TXNAME            GENEID
1 ENST00000456328.2 ENSG00000223972.5
2 ENST00000450305.2 ENSG00000223972.5
3 ENST00000473358.1 ENSG00000243485.5
4 ENST00000469289.1 ENSG00000243485.5
5 ENST00000607096.1 ENSG00000284332.1
6 ENST00000606857.1 ENSG00000268020.3

But how should I paste the third column with gene names? There are 2 problems: 1) The main problem - I can't find gene names in the list of columns(txdb):

> columns(txdb)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"   "CDSSTART"  
 [7] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"  
[13] "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"      
[19] "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"

2) Even if they were in txdb, how can I add a third column?

Thank you very much!

Best regards, Poecile

RNA-Seq R software error • 8.0k views
ADD COMMENT
0
Entering edit mode
4.7 years ago
Papyrus ★ 3.0k

You can try with the org.Hs.eg.db package, which has "SYMBOL" (gene names) and "ENSEMBL" (gene ID), "ENSEMBLTRANS" (tx ID):

> library(org.Hs.eg.db)
> columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"     
> keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"

As such (example with the first 5 gene ID keys):

> select(org.Hs.eg.db, keys= keys(org.Hs.eg.db, keytype="ENSEMBL")[1:5] , columns=c("SYMBOL","ENSEMBL","ENSEMBLTRANS"), keytype="ENSEMBL")
'select()' returned 1:many mapping between keys and columns
          ENSEMBL SYMBOL    ENSEMBLTRANS
1 ENSG00000121410   A1BG            <NA>
2 ENSG00000175899    A2M            <NA>
3 ENSG00000256069  A2MP1 ENST00000543404
4 ENSG00000256069  A2MP1 ENST00000566278
5 ENSG00000256069  A2MP1 ENST00000545343
6 ENSG00000256069  A2MP1 ENST00000544183
7 ENSG00000171428   NAT1            <NA>
8 ENSG00000156006   NAT2 ENST00000286479
9 ENSG00000156006   NAT2 ENST00000520116

By the way, if you're going to input your ENSEMBL IDs, you may sometimes want to remove the last part of the name (version). This can be easily done with stringr::str_split_fixed

ADD COMMENT
0
Entering edit mode

Thank you a lot, I have never heard about this package.

ADD REPLY

Login before adding your answer.

Traffic: 876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6