Question

curating a viridiplantae database

0

Entering edit mode

8.1 years ago

Biogeek ▴ 470

Dear all,

I have downloaded the latest Nr.gz file from NCBI and unzipped it. Now I want to only obtain the viridiplantae sequences from this Nr fasta file ONLY. I have tried downloading all the GI numbers for the plant protein sequences and doing a grep as follows.

grep -wFf GIsequences-list NR > viridiplantae.fasta

I however don't get any protein sequences in the output file. Just GI numbers and annotations.

Is there a script which can do better? or a command which I can use to get my so wanted viridplantae Nr database. I am using RAPSEARCH for speed rather than BlastX, so I can't supply the blastx command to search for taxonomic specific annotations.

Thanks.

blast sequences annotation • 2.3k views

ADD COMMENT • link updated 8.1 years ago by untitpoi ▴ 30 • written 8.1 years ago by Biogeek ▴ 470

score 0 · Answer 1 · 2016-03-25

0

Entering edit mode

8.1 years ago

untitpoi ▴ 30

Hi, I think using grep -A option could help you. It permits to get not only the line which match your pattern but also a number of line after it. Tough it is not the best solution if your fasta is not monoline which is often the case.

ADD COMMENT • link 8.1 years ago by untitpoi ▴ 30