Transcript_id to gene_name
3
0
Entering edit mode
6.4 years ago

I have tried biodbnet and biomart to retrieve gene name/gene id from coding transcript id, but I could not mapped about 22000 transcripts. Is there any other database or resource from where I could map these transcripts.? Thanks in advance.

gene • 3.8k views
ADD COMMENT
1
Entering edit mode

Please elaborate on the IDs that you have. There are many different types. Even paste some here, if you can. 'Coding transcript id' does not inform us if you have ENSEMBL transcript IDs, RefSeq IDs, or something else. Also, do you want HGNC / HUGO gene symbols? Thanks, Kevin.

ADD REPLY
0
Entering edit mode

Hi The Ids are like this: uc001aab.3, uc001aai.1, uc001aam.3, uc001aav.3, uc001aaz.2, uc001aba.1, uc001abc.2 I want HCGN gene symbol.

ADD REPLY
0
Entering edit mode

Thanks all for your reply. I am refering above link by toralmanvar https://webshare.bioinf.unc.edu/public/mRNAseq_TCGA/rsem_ref/unc_knownToLocus.txt There are about 73000 transcripts. But are they coding or non-coding transcripts?

ADD REPLY
0
Entering edit mode

I think the non-coding RNA's were sequenced by a different TCGA center so these should be coding AFAIK.

ADD REPLY
0
Entering edit mode

I have downloaded non-coding RNA from https://www.genenames.org/cgi-bin/statistics and I matched against the ones present here. https://webshare.bioinf.unc.edu/public/mRNAseq_TCGA/rsem_ref/unc_knownToLocus.txt Around 2000 matched, so is there any database for complete non-coding transcripts, so that I can download whole non-codings, and remove them

ADD REPLY
0
Entering edit mode

I checked with some TCGA folks and was told that there were some non-coding sequences present but the list was not comprehensive.

AFAIK GDC redid the entire RNAseq data analysis and must have used recent ID's. Is there a reason you are still using this old data?

ADD REPLY
0
Entering edit mode

They do appear to have re-analysd all RNA-seq. I have been downloading all TCGA RSEM count data from the GDC Legacy Archive. This was not available for all cancers, previously. Gene name in these files are HGNC IDs, which helps.

ADD REPLY
0
Entering edit mode

I am having a list of transcript ids (coding and non-coding) like uc011kvo.1, uc001aaa.3, uc001aab.3, uc001aai.1, uc001aak.2, uc001aal.1, uc001aam.3, uc001aau.2, uc001aav.3, uc001aaz.2. (27000 approx.) I just want to convert them to gene symbol and want to separate them into coding and non-coding.

ADD REPLY
0
Entering edit mode

You cannot distinguish coding from non-coding going by HGNC symbols, but you can do this by converting to RefSeq. In RefSeq, a 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding. See Table 1 - RefSeq accession numbers and molecule types.

Pierre manages to convert to both HGNC and RefSeq, here: How to convert UCSC ID to gene symbol

An alternative is to convert to HNC symbol and then look up the gene's biotype in the .

ADD REPLY
0
Entering edit mode

This may be a good case for submitting a ticket to UCSC genome browser support (genome at soe.ucsc.edu, they sometimes participate here but not frequently). They may be able to tell you how to do this classification.

ADD REPLY
0
Entering edit mode

Hi Pierre

Your code is good and running in my linux. But I have a list of 27000 ids, so how to do that?

ADD REPLY
0
Entering edit mode

dowload the file from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/kgXrefOld5.txt.gz , sort and use linux 'join'

ADD REPLY
0
Entering edit mode

Hi I have extracted all 24000 transcripts with their gene name, but the data contains coding, non-coding and pseudogenes. How to distinguish among them?. I want only coding genes. The data is like :

uc001aaa.3  BC032353    DDX11L1,
uc010nxr.1  AM992878    DDX11L1,
uc001aal.1  NM_001005484    Q8NH21,
uc001aav.3  NR_028327   LOC388312,
uc009vjk.2  AK293878    B4DF06,
uc001aaz.2  BC018860    BC018860,
uc001aba.1  X64709  X64709,
uc001abc.2  CR615613    CR615613,
uc010nya.1  CU692293    Q86XA8,

Also, as Kevin said above 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding, but the data contains other prefixes also BC, AM, X6, CR, AK, AB, FJ etc.

ADD REPLY
2
Entering edit mode
6.4 years ago

these are old/deprecated UCSC transcript identifiers:

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -P 3306 -e 'select kgID,geneSymbol from kgXrefOld5 where kgId in ("uc001aab.3","uc001aai.1", "uc001aam.3", "uc001aav.3", "uc001aaz.2", "uc001aba.1","uc001abc.2")' 
+------------+---------------+
| kgID       | geneSymbol    |
+------------+---------------+
| uc001aab.3 | DKFZp434K1323 |
| uc001aai.1 | DKFZp434K1323 |
| uc001aam.3 | DQ595736      |
| uc001aav.3 | LOC388312     |
| uc001aaz.2 | BC018860      |
| uc001aba.1 | X64709        |
| uc001abc.2 | CR615613      |
+------------+---------------+
ADD COMMENT
1
Entering edit mode

Nice, Pierre!

ADD REPLY
0
Entering edit mode
6.4 years ago
Tm ★ 1.1k

Similar issue is also discussed and addressed here.

ADD COMMENT

Login before adding your answer.

Traffic: 1336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6