Question

Extracting Gene/Protein Name Using Gi Number

0

Entering edit mode

11.6 years ago

bioinfo ▴ 840

I have a vmatch output below. Here, column 1 = illumina read id, column 2 = gene id matched in database. Now I want to get gene/protein name (e.g. tetR) using the gi number (e.g. 273547854) in column 2 and create another column. I know I can grab fasta sequences using -fastacmd but was wondering how to grab gene name easily after reading line by line. One complicated way I can do it by getting the fasta sequences then grab the gene/protein name from the description line but looking for other easy solutions. Thanks.

HWI-ST1018:3:1103:19370:10307#0/1       gi|273547854|gb|ACZ98166.1|             8.85e-122       132     100.00  33
HWI-ST1018:3:1103:11337:72779#0/1       gi|359330039|emb|CCE72361.1|            8.85e-122       132     100.00  33
HWI-ST1018:3:1104:2450:59529#0/1        gi|60280761|gb|AAX18265.1|              8.85e-122       132     100.00  33
HWI-ST1018:3:1105:19974:14323#0/1       gi|290510942|ref|ZP                     4.17e-110       120     100.00  30

• 5.7k views

ADD COMMENT • link updated 11.6 years ago by arnstrm ★ 1.9k • written 11.6 years ago by bioinfo ▴ 840

score 2 · Answer 1 · 2013-04-22

2

Entering edit mode

11.6 years ago

arnstrm ★ 1.9k

If you need the gene name (that is not present in the original file) then you need to fetch it from other database. I would use 'blastdbcmd' (previously fastacmd) using 'target_only' and '%t' (title only) options. Later you can pipe through awk/cut to extract the gene name from the title and then paste it to the existing file (as mentioned in previous answers).

blastdbcmd -entry_batch ID_file.txt -db DBname -target_only -outfmt %t

ADD COMMENT • link 11.6 years ago by arnstrm ★ 1.9k

0

Entering edit mode

It worked when I tried to get the gene names from nr database. I just tried to use my own protein database instead of nr but I got an error (BLAST Database error: No alias or index file found for nucleotide database). I'm not sure how to build an index of my protein database that can be used in blastdbcmd function though i already have .phr, .phn, .prj etc. kinds of indexed files of the database in the same directory created by vmatch. I am a bit suspicious about the next step as the error message says nucleotide database but here I want to deal with indexed protein database. Any suggestions?

ADD REPLY • link 11.6 years ago by bioinfo ▴ 840

1

Entering edit mode

If you're using own sequences, make sure you follow the gene name format as specified by NCBI (eg header line: >lcl|unique_identifier gene description). While formatting, make sure you have the parse_seqids option enabled. I normally use this: makeblastdb -in seq_file.fasta -dbtype prot -parse_seqids -title DBname -out Dbname and it works correctly with blastdbcmd.

ADD REPLY • link 11.6 years ago by arnstrm ★ 1.9k

0

Entering edit mode

This is great. It works. Thanks

ADD REPLY • link 11.6 years ago by bioinfo ▴ 840

score 0 · Answer 2 · 2013-04-22

try this to get the gi number:

 > perl -wn -e 'print "$1\n" if m{[|](.*?)[|]};' input.txt > id.txt
 273547854
 359330039
 60280761
 290510942

edit: To append this column to the existing file:

> paste -d" " input.txt id.txt 
HWI-ST1018:3:1103:19370:10307#0/1       gi|273547854|gb|ACZ98166.1|             8.85e-122       132     100.00  33 273547854
HWI-ST1018:3:1103:11337:72779#0/1       gi|359330039|emb|CCE72361.1|            8.85e-122       132     100.00  33 359330039
HWI-ST1018:3:1104:2450:59529#0/1        gi|60280761|gb|AAX18265.1|              8.85e-122       132     100.00  33 60280761
HWI-ST1018:3:1105:19974:14323#0/1       gi|290510942|ref|ZP                     4.17e-110       120     100.00  30 290510942

score 0 · Answer 3 · 2013-04-22

using 'sed' and the substitution groups

$ cat input.txt | sed 's/^\(.*\tgi|\)\([1-9]*\)\(\|.*\)/\1\2\3\t\2/'

example:

$ echo "HWI-ST1018:3:1103:19370:10307#0/1        gi|273547854|gb|ACZ98166.1|     8.85e-122       132     100.00  33" | sed 's/^\(.*\tgi|\)\([1-9]*\)\(\|.*\)/\1\2\3\t\2
HWI-ST1018:3:1103:19370:10307#0/1    gi|273547854|gb|ACZ98166.1|    8.85e-122    132    100.00  33    273547854

or awk if the lines always look the same:

$ echo "HWI-ST1018:3:1103:19370:10307#0/1        gi|273547854|gb|ACZ98166.1|     8.85e-122       132     100.00  33" | awk -F '|'   '{print $0"\t",$2; }'
HWI-ST1018:3:1103:19370:10307#0/1    gi|273547854|gb|ACZ98166.1|    8.85e-122    132    100.00  33     273547854