How to get protein ID from gene ID (batch entrez)
1
0
Entering edit mode
7.0 years ago
alansoffan • 0

Hi

Can someone suggest me how to get protein ID from gene ID (batch entrez).

I have hundreds of gene name like AaeL_AAEL004207 with gene ID 5564359. Manually we can get the protein ID one by one, the problem I have hundreds of that, obviously it seem not a good idea, any one can suggest me..?

Thanks

gene • 5.0k views
0
Entering edit mode

Thanks a lot for the suggestions,..well I haven't try that hopefully it will work

3
Entering edit mode
7.0 years ago
5heikki 10k

With Entrez Direct:

epost -db gene -id 5564359 | elink -target protein | efetch -format uid
157105044


You can include multiple gene IDs (at least 500) in the -id part, separated by commas. Here's a script:

#!/bin/bash
exist=$(which epost) if [$(echo $? != 0) ] then echo "Entrez Direct not in \$PATH"
exit
fi

if [ -n &quot;$1&quot; ] then split -l 500$1 input.

for f in input.*
do
ids=$(cat$f | tr "\n" ",")
epost -db gene -id $ids | elink -target protein | efetch -format uid >$f.output
paste $f$f.output > $f.result rm$f $f.output done cat *.result >$1.output
rm *.result

else
echo "Usage: sh convertGeneIDs listOfGeneIDs\nOutput: geneID\tproteinID"
fi

0
Entering edit mode

I was puzzled by the

if [ -n "$1" ]  line, which turns out to mean "if non-empty string" ADD REPLY 0 Entering edit mode non-empty first argument ;) ADD REPLY 0 Entering edit mode Moi Heikki, Thank you for writing this script! May I ask you more details about it? Kiitos Paljon! Best wishes, Xia ADD REPLY 0 Entering edit mode This was 7 years ago, I certainly wouldn't write it the same way now. Anyway, sure ask away.. ADD REPLY 0 Entering edit mode Thank you very much, Heikki. I have a large csv file containing protein IDs from 30 samples and counts for each protein ID of individual samples. I would like to use entrez direct to search each protein ID for the specific bacterial species. My supervisor mentioned that she adapted the script found online and gave it to me to use. Then, I found your script on Biostar. I am new to this field, so my questions may be very silly to you. Hope you won't mind. exist=$(which epost) Shall I define the csv file instead of using which epost?

for f in input.* Shall I define the input file?

Many thanks again, Heikki. If it is possible, may I contact you by email? My email address is x.yu2@leeds.ac.uk.

Best wishes, Xia

1
Entering edit mode

exist=$(which epost) checks that the entrez tools have been installed (well epost, but here it's assumed that if epost is in $PATH, then so is everything else that the script needs)

As to defining the input file, you just save the script to a file, e.g. convertGeneIds, then you make it executable chmod +x convertGeneIds, and then you can use it ./convertGeneIds inputFile

The script assumes that the input file is a list of ids, one on each line, no commas or anything like that

0
Entering edit mode

Thank you very much for your detailed reply, Heikki. I understand better now. I ran the script, but it showed that "command not found" for each ID.

The first column of our input file is a list of ids, followed by samples names, which is delimited by comma, as shown below:

The numbers in the input files are counts for each ID in individual samples. The script used is:

Could you give me some suggestions to improve it? Many thanks again.

1
Entering edit mode

You need to isolate just the ID. Script expects only ID's to be present nothing else. You can do the following to extract the ID's from column 1.

cut -d "," -f1 yourfile > new_file.

Then use the new_file with @5heikki's script as input.

Note: Don't post screenshots of code in comments. Always copy and paste the actual code.

0
Entering edit mode

Thank you very much for your suggestions!