How to create transcript-gene association matrix using Refseq IDs ?
8.1 years ago
jack ▴ 950

Hi all,

I have transcript and gene IDs in Refseq format like this :

Gene IDs:

ZNF498   IL11RA    KIF2A    NCOA3 ....

Transcript IDs:

NM_152486 NM_015658 NM_198317     NM_032129

I want create matrix which associate transcripts to it's gene. I looked at the Refseq database, but I couldn't find file which contain Gene and it's transcripts in Refseq IDs format. I don't want to convert my ids to other format, because I lose some of them in conversion.

Would someone help me how can I do this?

gene next-gen RNA-Seq R
8.1 years ago
komal.rathi ★ 4.0k

You could use biomaRt to get Refseq Transcript ID & Gene Symbol table:

ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")
results = getBM(attributes = c('refseq_mrna','hgnc_symbol'), mart = ensembl)

or if you have a list of Refseq Transcript IDs, say refseq_transcript_ID, then you can use:

results = getBM(attributes = c('refseq_mrna','hgnc_symbol'), filters = 'refseq_mrna', values = refseq_transcript_ID, mart = ensembl)

Alternatively, if you want a 'ready made' file with Transcript IDs and Gene Symbols, you can use gene2refseq.gz. The fields you are interested in are given under the names RNA_nucleotide_accession.version & Symbol.

And yes, Refseq Transcript ID to Gene Symbol is a many to one relationship.


