Question: Mapping refseq transcripts to encoded proteins (NM_ to NP_)
0
gravatar for Endre Bakken Stovner
4.7 years ago by
Norway
Endre Bakken Stovner880 wrote:

I'd like to find the protein encoded by my refgene transcripts. 

I have the refGene.txt file where each line looks like the following:

138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    68346571,68378643,68434283,68434627,68438153,68438903,68445927,68457068,68466069,68467974,68468811,68473549,68475967,68479879,    68346688,68379088,68434368,68434675,68438244,68439038,68446033,68457142,68466230,68468105,68468992,68473692,68476005,68480404,    0    PIAS1    cmpl    cmpl    0,0,1,2,2,0,0,1,0,2,1,2,1,0,

I'd like to know how I can get the NP_XXXX name for the transcript NM_016166.

Preferably, I'd like to just get a text file mapping the two in some way, but I'll accept any answer that does this automatically, e.g. with BioPython (having to look it up by hand in a browser or some such doesn't cut it - I need to do this for 50K transcripts).

I'm working with hg38, but I'm guessing the procedure is the same for all major genome versions, so I did not specify to make the Q as general as possible.

refseq • 2.8k views
ADD COMMENTlink modified 3.7 years ago • written 4.7 years ago by Endre Bakken Stovner880
2
gravatar for 5heikki
4.7 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

Here's one way with Entrez Direct:

 

IFS=$'\n'; for next in $(cut -f2 refGene.txt); do NP=$(esearch -db nuccore -query $next | elink -target protein | efetch -format docsum | xtract -element Caption); echo "$next $NP"; done

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by 5heikki8.4k

Thanks, I need to install some tools so will get back to you. Upvote.

Ps. other answers still welcome. Is there a file that contains these mappings somewhere?

ADD REPLYlink written 4.7 years ago by Endre Bakken Stovner880

Your previous, non-while version was better. This one includes an error (and only reads the first line anyways, see http://stackoverflow.com/questions/13800225/shell-script-while-read-line-loop-stops-after-the-first-line)

ADD REPLYlink written 4.7 years ago by Endre Bakken Stovner880

My, bad. I edited the answer again.

ADD REPLYlink written 4.7 years ago by 5heikki8.4k
2
gravatar for Endre Bakken Stovner
3.7 years ago by
Norway
Endre Bakken Stovner880 wrote:

Hello myself from the past! After having done similar tasks umpteen gazillion times you realized you should make them easier for yourself. Therefore you wrote biomartian (pip install biomartian:

echo "138 NM_016166 chr15 + 68346571 68480404 68346664 68480173 14" | sed 's/ /\t/g' | biomartian --noheader -c 1 -i refseq_mrna -o refseq_peptide -

0    1    2    3    4    5    6    7    8    refseq_peptide

138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    NP_057250

(The code before the biomartian call is just used to replace spaces with tabs).

See https://github.com/endrebak/biomartian for more

ADD COMMENTlink modified 3.6 years ago • written 3.7 years ago by Endre Bakken Stovner880

this example does not work in my hands. The refseq_peptide returns blank.

ADD REPLYlink written 2.7 years ago by Malcolm.Cook1.0k
1
gravatar for Endre Bakken Stovner
4.7 years ago by
Norway
Endre Bakken Stovner880 wrote:

The UCSC table to map transcripts to proteins is called refLink.

You can get it as a plaintext file by going to the UCSC table browser, selecting your genome of interest and then setting the group to "all tables". Next choose the table refLink, enter a filename and then press the "get output"-button.

Example output:

#name product mrnaAcc protAcc geneName prodName locusLinkId omimId
PAX2 paired box 2 NM_001282819 NP_001269748 133420 110035 102094402 0
MLPH melanophilin NM_001282836 NP_001269765 65090 92499 102093006 0

 

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Endre Bakken Stovner880

Great! Your anwser help me a  lot!

ADD REPLYlink written 4.4 years ago by Li Hsing10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 853 users visited in the last hour