I have to extract organism and host name from a NCBI .gp file but grep code is not creating something
1
0
Entering edit mode
3.6 years ago
oncelmiran6 ▴ 10

Hello everyone,

I have to extract accesion number and related host name from a .gp file that contains my meta data for a fasta file. I have tried this code but it shows nothing.

 grep -E 'VERSION.*/host' NCBITaxonomy_LassaVirus_\ 71720.gp -> miran.txt

accesion number starts after VERSION and host name starts after /host I couldn't figure out what is the problem I will be very happy if someone can help Thank you So much

fasta NCBI grep gp meta data • 1.0k views
ADD COMMENT
0
Entering edit mode

please post few lines from input file.

ADD REPLY
0
Entering edit mode

Hello Sir Thank you for your response this is a brief summary for my input file

VERSION     QKI86563.1

                 /organism="Lassa mammarenavirus"
                 /strain="Yar-217"
                 /isolation_source="blood"
                 /host="Mastomys natalensis"
                 /db_xref="taxon:11620"
                 /country="Guinea: Yarawalia
ADD REPLY
0
Entering edit mode

This is a genbank file yes? If so the Accession and organism information will likely be in the header lines.

You also haven't told us why your grep command fails - in what way is it incorrect?

ADD REPLY
0
Entering edit mode
3.6 years ago
JC 13k

The error is because grep is reading line by line and you don't have VERSION and /host in the same line, an alternative is to do 2 regex, in Perl:

$ perl -lane 'print $1 if(/VERSION\s+(.+)/ or /host="(.+)"/)' < file.gbk
QKI86563.1
Mastomys natalensis
ADD COMMENT
0
Entering edit mode

@JC OP is using a GenPept file, which should have similar format. host entry must be specific to OP's file.

LOCUS       AAA40590                 109 aa            linear   ROD 27-APR-1993
DEFINITION  insulin [Octodon degus].
ACCESSION   AAA40590
VERSION     AAA40590.1
DBSOURCE    locus OCOINS accession M57671.1
KEYWORDS    .
SOURCE      Octodon degus (degu)
  ORGANISM  Octodon degus

Your code produces:

$ perl -lane 'print $1 if(/VERSION\s+(.+)/ or /host="(.+)"/)' < sequence.gp
AAA40590.1
ADD REPLY

Login before adding your answer.

Traffic: 1515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6