Question: Conversion of full gene description to gene symbols
0
gravatar for shripathiacademics
13 months ago by
shripathiacademics10 wrote:

Hi, I have gone through the Biostars web page. I could not find the answer for my question. I am trying to convert 2000 gene descriptions (full name of genes) to gene symbols (acronyms). As I am working with a non-model species, Gene ID did not help much. I also tried bioDBnet and David. But not much help either. Is there any other online or offline tools I can use? thanks in advance.

sequence gene • 774 views
ADD COMMENTlink modified 13 months ago by genomax75k • written 13 months ago by shripathiacademics10

Can you give some examples? And which organism is this?

ADD REPLYlink written 13 months ago by Benn7.9k

For example, I have "full gene name" called "sulfatase 1". Gene symbol for that is SULF1. I have 2000 such full names. Organism I am working on is salmon.

ADD REPLYlink written 13 months ago by shripathiacademics10
2
gravatar for genomax
13 months ago by
genomax75k
United States
genomax75k wrote:

You could use NCBI unix utilities. With the example you posted above (assuming you are working with Atlantic Salmon:

$ esearch -db gene -query "sulfatase 1 [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>sulf1</Name>

$ esearch -db gene -query "monooxygenase [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>ywhaz</Name>
        <Name>coq6</Name>
        <Name>ywhah</Name>
        <Name>fmo5</Name>
        <Name>moxd1</Name>
        <Name>agmo</Name>
        <Name>coq6</Name>
        <Name>msmo1</Name>
        <Name>pam</Name>
        <Name>bcmo1</Name>

As long as the titles you have are specific they should result in a single name. Otherwise you may get more than one gene (example #2 above).

ADD COMMENTlink modified 13 months ago • written 13 months ago by genomax75k

Hi again, thanks for the suggestion. When I tried to run this in loop for the list of 1900 genes it didnot work the way I wanted.

#!/bin/bash
cat /home/softwares/genelist.txt |
while read line
do
   esearch -db gene -query "$line [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name" 2>&1 | tee log-gene.txt
done

it gave me gene symbol for first line only. But if i remove TITLE and ORGN information it seems like working. what do you think wrong here?

ADD REPLYlink modified 13 months ago by Benn7.9k • written 13 months ago by shripathiacademics10
1

Please use code format for code, I have changed it now for you.

ADD REPLYlink written 13 months ago by Benn7.9k

If you provide additional examples I can take a look.

Things to check is the example is for "Atlantic Salmon". If you are working with a different species you need to replace the relevant latin name in the command. You should also add a sleep step in your loop so NCBI does not flag your IP. You will also want to sign up for NCBI API keys's if you are doing that many queries.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax75k

thanks again.

here is the small example:

solute carrier family 28 member 3 
ran-binding protein 3 
DNA polymerase zeta catalytic subunit 
dynein heavy chain 8, axonemal 
activin A receptor type 1 
regulator of nonsense transcripts 1 
target of EGR1, member 1 (nuclear) 
methyltransferase like 9 
natural resistance-associated macrophage protein 2 
lymphoid-restricted membrane protein 
transmembrane protein 39B 
spindlin-Z

let me know if this list helps?

ADD REPLYlink modified 13 months ago by RamRS25k • written 13 months ago by shripathiacademics10
2

Try the following:

$ while read i; do echo $i; esearch -db gene -query "$i [TITLE] AND Salmo salar [ORGN]" < /dev/null | esummary | grep -w "Name"; done < list
solute carrier family 28 member 3
        <Name>slc28a3</Name>
ran-binding protein 3
        <Name>ranb3</Name>
        <Name>LOC106600517</Name>
        <Name>LOC106573555</Name>
        <Name>LOC106569622</Name>
        <Name>LOC106568171</Name>
        <Name>LOC106563114</Name>
DNA polymerase zeta catalytic subunit
dynein heavy chain 8, axonemal
activin A receptor type 1
        <Name>acvr1</Name>
regulator of nonsense transcripts 1
        <Name>LOC106613627</Name>
        <Name>LOC106599024</Name>
        <Name>LOC106573910</Name>
        <Name>LOC106568944</Name>
target of EGR1, member 1 (nuclear)
methyltransferase like 9
        <Name>mettl9</Name>
natural resistance-associated macrophage protein 2
        <Name>LOC106583265</Name>
        <Name>LOC106572643</Name>
        <Name>LOC106567082</Name>
        <Name>LOC106565428</Name>
lymphoid-restricted membrane protein
        <Name>LOC106609163</Name>
        <Name>LOC106576138</Name>
transmembrane protein 39B
        <Name>tmem39b</Name>
spindlin-Z
        <Name>spinz</Name>

Thanks to @RamRS for the pointer that was essential.

ADD REPLYlink written 13 months ago by genomax75k
2

Addendum (also something I learned today, thank you @GenoMax):

while IFS= read line
do
    # commands here
done < in_file

consumes the entire file and allows the #commands here command to read all of it, whereas

for line in $(cat in_file)
do
    # commands here
done

makes the shell split the file by white space and feeds the content chunk by chunk to the #commands here commands.

If you need for to split only by new line, use:

OLD_IFS=$IFS
IFS=$'\n'
for line in $(cat in_file)
do
     #commands here
done
IFS=$OLD_IFS
unset OLD_IFS

If you don't wish to use a temporary variable to store $IFS, you can check out other options here: https://unix.stackexchange.com/a/92190/135331

ADD REPLYlink modified 13 months ago • written 13 months ago by RamRS25k
0
gravatar for Anima Mundi
13 months ago by
Anima Mundi2.5k
Italy
Anima Mundi2.5k wrote:

Hello, in your shoes I probably would:

a) retrieve all NCBI sequences for your taxon in GenBank format

b) search for each gene description in your GenBank database (e.g. in the "Official Full Name" field)

c) fetch the corresponding gene symbol (i.e. in the "Official Symbol" field)

Hope this helps!

ADD COMMENTlink modified 13 months ago • written 13 months ago by Anima Mundi2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1792 users visited in the last hour