Question: How to create tx2gene object with gene names from annotation
0
gravatar for poecile.pal
3 months ago by
poecile.pal0 wrote:

Hello all,

I would like to have tx2gene object with 3 columns - the third column should be with gene names like PLA2G4A etc. The using annotation looks like this:

##description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-09-05
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Firstly I created tx2gene object with 2 columns:

> txdb <- makeTxDbFromGFF(file="gencode.v32.annotation.gtf")
> saveDb(x=txdb, file = "gencode.v32.annotation.TxDb")
> k <- keys(txdb, keytype = "TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME") 
> head(tx2gene)
             TXNAME            GENEID
1 ENST00000456328.2 ENSG00000223972.5
2 ENST00000450305.2 ENSG00000223972.5
3 ENST00000473358.1 ENSG00000243485.5
4 ENST00000469289.1 ENSG00000243485.5
5 ENST00000607096.1 ENSG00000284332.1
6 ENST00000606857.1 ENSG00000268020.3

But how should I paste the third column with gene names? There are 2 problems: 1) The main problem - I can't find gene names in the list of columns(txdb):

> columns(txdb)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"   "CDSSTART"  
 [7] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"  
[13] "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"      
[19] "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"

2) Even if they were in txdb, how can I add a third column?

Thank you very much!

Best regards, Poecile

rna-seq R software error • 185 views
ADD COMMENTlink modified 3 months ago by Papyrus210 • written 3 months ago by poecile.pal0
0
gravatar for Papyrus
3 months ago by
Papyrus210
Papyrus210 wrote:

You can try with the org.Hs.eg.db package, which has "SYMBOL" (gene names) and "ENSEMBL" (gene ID), "ENSEMBLTRANS" (tx ID):

> library(org.Hs.eg.db)
> columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"     
> keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
[17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[25] "UNIGENE"      "UNIPROT"

As such (example with the first 5 gene ID keys):

> select(org.Hs.eg.db, keys= keys(org.Hs.eg.db, keytype="ENSEMBL")[1:5] , columns=c("SYMBOL","ENSEMBL","ENSEMBLTRANS"), keytype="ENSEMBL")
'select()' returned 1:many mapping between keys and columns
          ENSEMBL SYMBOL    ENSEMBLTRANS
1 ENSG00000121410   A1BG            <NA>
2 ENSG00000175899    A2M            <NA>
3 ENSG00000256069  A2MP1 ENST00000543404
4 ENSG00000256069  A2MP1 ENST00000566278
5 ENSG00000256069  A2MP1 ENST00000545343
6 ENSG00000256069  A2MP1 ENST00000544183
7 ENSG00000171428   NAT1            <NA>
8 ENSG00000156006   NAT2 ENST00000286479
9 ENSG00000156006   NAT2 ENST00000520116

By the way, if you're going to input your ENSEMBL IDs, you may sometimes want to remove the last part of the name (version). This can be easily done with stringr::str_split_fixed

ADD COMMENTlink written 3 months ago by Papyrus210

Thank you a lot, I have never heard about this package.

ADD REPLYlink written 3 months ago by poecile.pal0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1609 users visited in the last hour