Extracting Gene/Protein Name Using Gi Number
3
0
Entering edit mode
11.0 years ago
bioinfo ▴ 830

I have a vmatch output below. Here, column 1 = illumina read id, column 2 = gene id matched in database. Now I want to get gene/protein name (e.g. tetR) using the gi number (e.g. 273547854) in column 2 and create another column. I know I can grab fasta sequences using -fastacmd but was wondering how to grab gene name easily after reading line by line. One complicated way I can do it by getting the fasta sequences then grab the gene/protein name from the description line but looking for other easy solutions. Thanks.

HWI-ST1018:3:1103:19370:10307#0/1       gi|273547854|gb|ACZ98166.1|             8.85e-122       132     100.00  33
HWI-ST1018:3:1103:11337:72779#0/1       gi|359330039|emb|CCE72361.1|            8.85e-122       132     100.00  33
HWI-ST1018:3:1104:2450:59529#0/1        gi|60280761|gb|AAX18265.1|              8.85e-122       132     100.00  33
HWI-ST1018:3:1105:19974:14323#0/1       gi|290510942|ref|ZP                     4.17e-110       120     100.00  30
• 5.5k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
arnstrm ★ 1.8k

If you need the gene name (that is not present in the original file) then you need to fetch it from other database. I would use 'blastdbcmd' (previously fastacmd) using 'target_only' and '%t' (title only) options. Later you can pipe through awk/cut to extract the gene name from the title and then paste it to the existing file (as mentioned in previous answers).

blastdbcmd -entry_batch ID_file.txt -db DBname -target_only -outfmt %t
ADD COMMENT
0
Entering edit mode

It worked when I tried to get the gene names from nr database. I just tried to use my own protein database instead of nr but I got an error (BLAST Database error: No alias or index file found for nucleotide database). I'm not sure how to build an index of my protein database that can be used in blastdbcmd function though i already have .phr, .phn, .prj etc. kinds of indexed files of the database in the same directory created by vmatch. I am a bit suspicious about the next step as the error message says nucleotide database but here I want to deal with indexed protein database. Any suggestions?

ADD REPLY
1
Entering edit mode

If you're using own sequences, make sure you follow the gene name format as specified by NCBI (eg header line: >lcl|unique_identifier gene description). While formatting, make sure you have the parse_seqids option enabled. I normally use this: makeblastdb -in seq_file.fasta -dbtype prot -parse_seqids -title DBname -out Dbname and it works correctly with blastdbcmd.

ADD REPLY
0
Entering edit mode

This is great. It works. Thanks

ADD REPLY
0
Entering edit mode
11.0 years ago
munch ▴ 310

try this to get the gi number:

 > perl -wn -e 'print "$1\n" if m{[|](.*?)[|]};' input.txt > id.txt
 273547854
 359330039
 60280761
 290510942

edit: To append this column to the existing file:

> paste -d" " input.txt id.txt 
HWI-ST1018:3:1103:19370:10307#0/1       gi|273547854|gb|ACZ98166.1|             8.85e-122       132     100.00  33 273547854
HWI-ST1018:3:1103:11337:72779#0/1       gi|359330039|emb|CCE72361.1|            8.85e-122       132     100.00  33 359330039
HWI-ST1018:3:1104:2450:59529#0/1        gi|60280761|gb|AAX18265.1|              8.85e-122       132     100.00  33 60280761
HWI-ST1018:3:1105:19974:14323#0/1       gi|290510942|ref|ZP                     4.17e-110       120     100.00  30 290510942
ADD COMMENT
0
Entering edit mode

I am actually planning to grab the gene/protein names using these gi numbers and make a new column.

ADD REPLY
0
Entering edit mode
11.0 years ago

using 'sed' and the substitution groups

$ cat input.txt | sed 's/^\(.*\tgi|\)\([1-9]*\)\(\|.*\)/\1\2\3\t\2/'

example:

$ echo "HWI-ST1018:3:1103:19370:10307#0/1        gi|273547854|gb|ACZ98166.1|     8.85e-122       132     100.00  33" | sed 's/^\(.*\tgi|\)\([1-9]*\)\(\|.*\)/\1\2\3\t\2
HWI-ST1018:3:1103:19370:10307#0/1    gi|273547854|gb|ACZ98166.1|    8.85e-122    132    100.00  33    273547854

or awk if the lines always look the same:

$ echo "HWI-ST1018:3:1103:19370:10307#0/1        gi|273547854|gb|ACZ98166.1|     8.85e-122       132     100.00  33" | awk -F '|'   '{print $0"\t",$2; }'
HWI-ST1018:3:1103:19370:10307#0/1    gi|273547854|gb|ACZ98166.1|    8.85e-122    132    100.00  33     273547854
ADD COMMENT
0
Entering edit mode

Pierre & munch both are right to make a new coulmn with gi number but as I want gene name using the gi, dont you think that if I match that new gi column to a new txt file containing all the gene name with corresponding gi and extract the gene name from the new file and create another column with gene name?

ADD REPLY

Login before adding your answer.

Traffic: 2276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6