Question

Conversion of full gene description to gene symbols

2

Entering edit mode

5.5 years ago

shripathiacademics ▴ 30

Hi, I have gone through the Biostars web page. I could not find the answer for my question. I am trying to convert 2000 gene descriptions (full name of genes) to gene symbols (acronyms). As I am working with a non-model species, Gene ID did not help much. I also tried bioDBnet and David. But not much help either. Is there any other online or offline tools I can use? thanks in advance.

gene sequence • 4.3k views

ADD COMMENT • link updated 5.5 years ago by GenoMax 141k • written 5.5 years ago by shripathiacademics ▴ 30

0

Entering edit mode

Can you give some examples? And which organism is this?

ADD REPLY • link 5.5 years ago by Benn 8.3k

0

Entering edit mode

For example, I have "full gene name" called "sulfatase 1". Gene symbol for that is SULF1. I have 2000 such full names. Organism I am working on is salmon.

ADD REPLY • link 5.5 years ago by shripathiacademics ▴ 30

Benn · Answer 1 · 2018-10-23

2

Entering edit mode

5.5 years ago

GenoMax 141k

You could use NCBI unix utilities. With the example you posted above (assuming you are working with Atlantic Salmon:

$ esearch -db gene -query "sulfatase 1 [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>sulf1</Name>

$ esearch -db gene -query "monooxygenase [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>ywhaz</Name>
        <Name>coq6</Name>
        <Name>ywhah</Name>
        <Name>fmo5</Name>
        <Name>moxd1</Name>
        <Name>agmo</Name>
        <Name>coq6</Name>
        <Name>msmo1</Name>
        <Name>pam</Name>
        <Name>bcmo1</Name>

As long as the titles you have are specific they should result in a single name. Otherwise you may get more than one gene (example #2 above).

ADD COMMENT • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Hi again, thanks for the suggestion. When I tried to run this in loop for the list of 1900 genes it didnot work the way I wanted.

#!/bin/bash
cat /home/softwares/genelist.txt |
while read line
do
   esearch -db gene -query "$line [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name" 2>&1 | tee log-gene.txt
done

it gave me gene symbol for first line only. But if i remove TITLE and ORGN information it seems like working. what do you think wrong here?

ADD REPLY • link updated 5.5 years ago by Benn 8.3k • written 5.5 years ago by shripathiacademics ▴ 30

1

Entering edit mode

Please use code format for code, I have changed it now for you.

ADD REPLY • link 5.5 years ago by Benn 8.3k

0

Entering edit mode

If you provide additional examples I can take a look.

Things to check is the example is for "Atlantic Salmon". If you are working with a different species you need to replace the relevant latin name in the command. You should also add a sleep step in your loop so NCBI does not flag your IP. You will also want to sign up for NCBI API keys's if you are doing that many queries.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

thanks again.

here is the small example:

solute carrier family 28 member 3 
ran-binding protein 3 
DNA polymerase zeta catalytic subunit 
dynein heavy chain 8, axonemal 
activin A receptor type 1 
regulator of nonsense transcripts 1 
target of EGR1, member 1 (nuclear) 
methyltransferase like 9 
natural resistance-associated macrophage protein 2 
lymphoid-restricted membrane protein 
transmembrane protein 39B 
spindlin-Z

let me know if this list helps?

ADD REPLY • link updated 5.5 years ago by Ram 43k • written 5.5 years ago by shripathiacademics ▴ 30

2

Entering edit mode

Try the following:

$ while read i; do echo $i; esearch -db gene -query "$i [TITLE] AND Salmo salar [ORGN]" < /dev/null | esummary | grep -w "Name"; done < list
solute carrier family 28 member 3
        <Name>slc28a3</Name>
ran-binding protein 3
        <Name>ranb3</Name>
        <Name>LOC106600517</Name>
        <Name>LOC106573555</Name>
        <Name>LOC106569622</Name>
        <Name>LOC106568171</Name>
        <Name>LOC106563114</Name>
DNA polymerase zeta catalytic subunit
dynein heavy chain 8, axonemal
activin A receptor type 1
        <Name>acvr1</Name>
regulator of nonsense transcripts 1
        <Name>LOC106613627</Name>
        <Name>LOC106599024</Name>
        <Name>LOC106573910</Name>
        <Name>LOC106568944</Name>
target of EGR1, member 1 (nuclear)
methyltransferase like 9
        <Name>mettl9</Name>
natural resistance-associated macrophage protein 2
        <Name>LOC106583265</Name>
        <Name>LOC106572643</Name>
        <Name>LOC106567082</Name>
        <Name>LOC106565428</Name>
lymphoid-restricted membrane protein
        <Name>LOC106609163</Name>
        <Name>LOC106576138</Name>
transmembrane protein 39B
        <Name>tmem39b</Name>
spindlin-Z
        <Name>spinz</Name>

Thanks to @RamRS for the pointer that was essential.

ADD REPLY • link 5.5 years ago by GenoMax 141k

2

Entering edit mode

Addendum (also something I learned today, thank you @GenoMax):

while IFS= read line
do
    # commands here
done < in_file

consumes the entire file and allows the #commands here command to read all of it, whereas

for line in $(cat in_file)
do
    # commands here
done

makes the shell split the file by white space and feeds the content chunk by chunk to the #commands here commands.

If you need for to split only by new line, use:

OLD_IFS=$IFS
IFS=$'\n'
for line in $(cat in_file)
do
     #commands here
done
IFS=$OLD_IFS
unset OLD_IFS

If you don't wish to use a temporary variable to store $IFS, you can check out other options here: https://unix.stackexchange.com/a/92190/135331

ADD REPLY • link 5.5 years ago by Ram 43k

score 0 · Answer 2 · 2018-10-23

0

Entering edit mode

5.5 years ago

Anima Mundi ★ 2.9k

Hello, in your shoes I probably would:

a) retrieve all NCBI sequences for your taxon in GenBank format

b) search for each gene description in your GenBank database (e.g. in the "Official Full Name" field)

c) fetch the corresponding gene symbol (i.e. in the "Official Symbol" field)

Hope this helps!

ADD COMMENT • link 5.5 years ago by Anima Mundi ★ 2.9k