Question: get biotype for refseq NM transcripts
3
gravatar for Pablo Marin-Garcia
3.3 years ago by
Spain
Pablo Marin-Garcia1.8k wrote:

I woudl like to have the biotypes (like in ensembl transcripts) but for the NCBI refSeq downloaded from UCSC mysql. Which table should I join and which column should I use?

My current query without the biotype is:

       mysql --user=genome --host=genome-mysql.cse.ucsc.edu -u genome -D hg19 -N -A -e 'select bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,name2 from refGene'

 

refseq • 2.4k views
ADD COMMENTlink modified 3.3 years ago by Dan Gaston7.1k • written 3.3 years ago by Pablo Marin-Garcia1.8k
2

Does it have to be via UCSC? If you use Ensembl, you could get definitely get the biotypes.

ADD REPLYlink written 3.3 years ago by igor7.0k

For some aplications I use refseq and HGMD, both using NM. Has Ensembl already solved the problems mapping/integrating refseqs ids?

ADD REPLYlink written 3.3 years ago by Pablo Marin-Garcia1.8k
1

Using either biomart or UCSC you should be able to generate a table with the mappings between IDs. I have updated my answer below to outline how I did this with the UCSC Table Browser

ADD REPLYlink written 3.3 years ago by Dan Gaston7.1k
1

Yes, you could get mapping between different IDs from UCSC. However, as far as I know, you cannot get the biotypes from UCSC. That was where I was going with my initial question, which probably wasn't very clear.

ADD REPLYlink written 3.3 years ago by igor7.0k
1

Of course, I understood that. I was just pointing out that really you need to use both most likely. You may be able to do something similar to what I posted below and get the RefSeq IDs in a table through Biomart directly all in one step.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Dan Gaston7.1k
5
gravatar for Dan Gaston
3.3 years ago by
Dan Gaston7.1k
Canada
Dan Gaston7.1k wrote:

Technically the biotype is for a transcript, and not a gene. While in many cases the biotype of all transcripts for a gene will be the same, you get a few that aren't. That said I'm not sure off the top of my head if you can do a join within UCSC between the two tables. You might need to output two datasets (refseq and ensembl) and cross-correlate them yourself with a little script. You can output alternative IDs in both tables, and use whichever you prefer to link the two tables.

 

Update: This is a relatively step-by-step guide of how I generated a table linking ensembl Transcript IDs to RefSeq IDs:

UCSC Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables

Group: Gene and Gene Predictions Group: Ensembl Table; ensGene

Output format: selected fields from primary and related tables

1) Click on get output

2) On next page under linked tables click the box beside ccdsInfo and then click allow selection from checked tables

3) More tables come up for linking click on ccds id under the CCDS table info fields. Also click on the table knownTo refseq and allow selection from checked tables again

4) Under known to refseq click both fields (primary id and value)

5) click get output near the top of the page under the fields for the ensembl table

 

You'll end up with ensembl transcript IDs in the first column and a list of NM IDs in the final column. You can then process this however you like to get a mapping of refseq IDs to Ensembl. I didn't poke around enough to see if I could find a linked table to get biotype IDs, that may be easier to get from BioMart on the ensembl website itself. With those two files you should be able to parse them as tab delimited data and create a mapping file, associate biotypes, etc.with a fairly simple perl/python/scripting language of choice script.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Dan Gaston7.1k
1

"protein coding genes" was a misstype I wanted to say 'transcript'. I wanted to double check how to filter the NM_ list from a gene, for the ones that are for real and 'functional' transcripts. Also I would like to query in which tissue are they the primary transcript if any. Probably expression atlas would be the place to go for this last one question.

ADD REPLYlink written 3.3 years ago by Pablo Marin-Garcia1.8k
1
gravatar for poisonAlien
3.3 years ago by
poisonAlien2.6k
Asgard
poisonAlien2.6k wrote:

Simplest way is to look at refseq id's.

Protein coding genes start with 'NM' and non coding’s starts with 'NR'

There is whole bunch of nomenclature.

ADD COMMENTlink written 3.3 years ago by poisonAlien2.6k
2

That would only give you two biotypes. Ensembl/GENCODE provides a much richer classification: http://www.gencodegenes.org/gencode_biotypes.html

ADD REPLYlink written 3.3 years ago by igor7.0k

Does the NM have biotype labels as diverse as ensembl has?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Pablo Marin-Garcia1.8k
1

NM* is just an NCBI ID and there is no biotype associated with it as far as I know.

ADD REPLYlink written 3.3 years ago by igor7.0k
1

I saw that in wikipedia NR is for non coding, but I read somewareelse that NR was for predicted transcripts, so I was not sure.

ADD REPLYlink written 3.3 years ago by Pablo Marin-Garcia1.8k
2

N is known, X is predicted.

NM: known mRNA

XM: predicted mRNA

NR: known ncRNA

XR: predicted ncRNA

ADD REPLYlink written 3.3 years ago by Emily_Ensembl16k

Thanks Emily i took me some time to find it out after posting the question. Never come across a XR thanks for that one. I wish this nomenclature were more easy to find when you google for refseq transcript nomenclature. You go to 

http://www.ncbi.nlm.nih.gov/refseq/about/

and you can read only about the X and the N:

Definitions:

  • Model RefSeq:  RNA and protein products that are generated by the eukaryotic genome annotation pipeline. These records use accession prefixes XM_, XR_, and XP_.
  • Known RefSeq: RNA and protein products that are mainly derived from GenBank cDNA and EST data and are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_, NR_, and NP_.

The word 'predicted' I think is a bit more meaningful than 'generated' when you quick read things in  'generated by the eukaryotic genome annotation pipeline'. 'Generated' is fine, but not easy to get its real meaning when you are in a hurry :-(

ADD REPLYlink written 3.3 years ago by Pablo Marin-Garcia1.8k

NM=mRNA  NR=RNA from NCBI tutorials.

When you think it twice, it make sense to read it as NR not-mRNA

http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

ADD REPLYlink written 3.3 years ago by Pablo Marin-Garcia1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2026 users visited in the last hour