I have a vmatch output below. Here, column 1 = illumina read id, column 2 = gene id matched in database. Now I want to get gene/protein name (e.g. tetR) using the gi number (e.g. 273547854) in column 2 and create another column. I know I can grab fasta sequences using -fastacmd
but was wondering how to grab gene name easily after reading line by line. One complicated way I can do it by getting the fasta sequences then grab the gene/protein name from the description line but looking for other easy solutions. Thanks.
HWI-ST1018:3:1103:19370:10307#0/1 gi|273547854|gb|ACZ98166.1| 8.85e-122 132 100.00 33
HWI-ST1018:3:1103:11337:72779#0/1 gi|359330039|emb|CCE72361.1| 8.85e-122 132 100.00 33
HWI-ST1018:3:1104:2450:59529#0/1 gi|60280761|gb|AAX18265.1| 8.85e-122 132 100.00 33
HWI-ST1018:3:1105:19974:14323#0/1 gi|290510942|ref|ZP 4.17e-110 120 100.00 30
It worked when I tried to get the gene names from nr database. I just tried to use my own protein database instead of nr but I got an error (BLAST Database error: No alias or index file found for nucleotide database). I'm not sure how to build an index of my protein database that can be used in
blastdbcmd
function though i already have .phr, .phn, .prj etc. kinds of indexed files of the database in the same directory created by vmatch. I am a bit suspicious about the next step as the error message says nucleotide database but here I want to deal with indexed protein database. Any suggestions?If you're using own sequences, make sure you follow the gene name format as specified by NCBI (eg header line:
>lcl|unique_identifier gene description
). While formatting, make sure you have the parse_seqids option enabled. I normally use this:makeblastdb -in seq_file.fasta -dbtype prot -parse_seqids -title DBname -out Dbname
and it works correctly withblastdbcmd
.This is great. It works. Thanks