Question: curating a viridiplantae database
I have downloaded the latest Nr.gz file from NCBI and unzipped it. Now I want to only obtain the viridiplantae sequences from this Nr fasta file ONLY. I have tried downloading all the GI numbers for the plant protein sequences and doing a grep as follows.

grep -wFf GIsequences-list NR > viridiplantae.fasta

I however don't get any protein sequences in the output file. Just GI numbers and annotations.

Is there a script which can do better? or a command which I can use to get my so wanted viridplantae Nr database. I am using RAPSEARCH for speed rather than BlastX, so I can't supply the blastx command to search for taxonomic specific annotations.


Hi, I think using grep -A option could help you. It permits to get not only the line which match your pattern but also a number of line after it. Tough it is not the best solution if your fasta is not monoline which is often the case.

