Question

Convert between RefSeq and Ensembl Transcript?

5

Entering edit mode

9.8 years ago

pwg46 ▴ 540

Hello, I am looking for a good data file to convert between refseq (specifically, accession identifiers such as NM_...) and ensembl transcript (ENST...). I am looking for a file which can be downloaded easily via a simple script usng ftp. Also, I would prefer it to be some form of text file with a decent format, as I would essentially be parsing the enst and NM_ parts and insertng them into a mysql table.

refseq ensembl • 35k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 9.8 years ago by pwg46 ▴ 540

0

Entering edit mode

Are you looking to store ENS and NM ids for the same sequence? Or do you wanna store GenBank and ENSEMBL entries?

ADD REPLY • link 9.8 years ago by Ram 43k

0

Entering edit mode

Hmm, I guess the latter. I've been looking through the GenBank data files, and they have large data files for each chromosome. These files do have ENST -> NM_ mappings for every trancsript on each chromosome, however I feel like using these data files would not be efficient. Not only are they large and take a fairly long time to download, but also parser scripts would take quite a while even though I simply want to create a tab-delimited txt file where the ENST id would be the first tab, and its corresponding NM_ id would be in the second tab.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by pwg46 ▴ 540

0

Entering edit mode

I'd suggest tinkering with UCSC Genome Browser's mysql database. You should be able to write a query/script that, given ID1, does a bunch of SELECTs for ID2.

ADD REPLY • link 9.8 years ago by Ram 43k

Ram · Answer 1 · 2014-07-15

mysql -u anonymous -h ensembldb.ensembl.org

mysql> use homo_sapiens_core_75_37;

mysql> SELECT transcript.stable_id, xref.display_label FROM transcript, object_xref, xref,external_db WHERE transcript.transcript_id = object_xref.ensembl_id AND object_xref.ensembl_object_type = 'Transcript' AND object_xref.xref_id = xref.xref_id AND xref.external_db_id = external_db.external_db_id AND external_db.db_name = 'RefSeq_mRNA';

+-----------------+----------------+
| stable_id       | display_label  |
+-----------------+----------------+
| ENST00000517143 | NR_046932.1    |
| ENST00000362897 | NR_046944.1    |
| ENST00000384568 | NR_046948.1    |
| ENST00000384769 | NR_002574.1    |
| ENST00000384323 | NR_002575.1    |
| ENST00000516690 | NR_046934.1    |
| ENST00000363970 | NR_046947.1    |
| ENST00000365367 | NR_046928.1    |
| ENST00000517119 | NR_046935.1    |
| ENST00000427390 | NM_001145004.1 |
| ENST00000384474 | NR_046940.1    |
| ENST00000454856 | NM_001277303.1 |
| ENST00000516986 | NR_046930.1    |
| ENST00000559471 | NM_001193489.1 |
| ENST00000261847 | NM_014701.3    |
| ENST00000439682 | NM_001277304.1 |
| ENST00000439682 | NM_207355.2    |
| ENST00000411348 | NR_046931.1    |
| ENST00000298232 | NM_199259.2    |
| ENST00000361285 | NM_199261.2    |
| ENST00000342420 | NM_199260.2    |
| ENST00000569541 | NM_031421.2    |
| ENST00000299443 | NM_174981.3    |
| ENST00000399848 | NM_181482.4    |
| ENST00000359446 | NM_181481.4    |

Ram · Answer 2 · 2014-07-14

This is to query ensembl/biomart programmatically via the R library biomaRt:

library("biomaRt")
ensembl<-  useMart("ensembl", dataset="hsapiens_gene_ensembl")

values<- c("NM_001101", "NM_001256799", "NM_000594")

getBM(attributes=c("refseq_mrna", "ensembl_gene_id", "hgnc_symbol"), filters = "refseq_mrna", values = values, mart= ensembl)

Results:

   refseq_mrna ensembl_gene_id hgnc_symbol
1    NM_000594 ENSG00000232810         TNF
2    NM_001101 ENSG00000075624        ACTB
3 NM_001256799 ENSG00000111640       GAPDH

The output of getBM() can be written to a file in tabular format using write.table().

To know which datasets are in biomart and what attributes and filters they have:

listDatasets(useMart("ensembl"))
listFilters(ensembl)
listAttributes(ensembl)

I would also look into the various databases maintained in Bioconductor like org.Hs.eg.db

Ram · Answer 3 · 2015-12-08

4

Entering edit mode

8.4 years ago

ashbigdeli ▴ 40

I found this from another post, but I'll echo it here. For all ID conversions I have found this tool to be so so useful and its updated regularly.

http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by ashbigdeli ▴ 40

Ram · Answer 4 · 2014-07-14

2

Entering edit mode

9.8 years ago

David Westergaard ★ 1.5k

Why not just use Biomart? e.g. http://www.ensembl.org/biomart/martview/49b845b7dcb16ac15fcd8cd7d1461c6c gives you all RefSeq IDs and Ensembl Transcript IDs in a tab seperated format. You can get the Perl code hitting the "Perl" button.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

I prefer Perl to R, but the Perl Biomart API is pretty rough. I wouldn't rely on it for more than very simple bulk downloads. Also, depending on the species, some biotypes (e.g. mRNA) do not have NCBI RefSeq annotations available to them.

ADD REPLY • link 9.8 years ago by pld 5.1k

0

Entering edit mode

That might be true. Honestly, I have never used the Biomart API myself, as I prefer Python to Perl, I just always go for the XML and just submit a query to Biomart. Or simply download the full translation. There is no need to automate what only needs to be done once or twice. An example of automated queries can be found at Automating Database Searches.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by David Westergaard ★ 1.5k