Is It Possible To Extract From Ncbi Nr Database Plant Protein Sequences.？
2
3
Entering edit mode
11.1 years ago
Lhl ▴ 730

Hi all,

I am doing annotation of a de novo sequenced non-model plant genome. Specifically, i am annotating the genome through blastx-ing the NCBI NR database. Obviously, it will be much faster if i can draw only plant protein sequences from the large NR database. So ia m wondering if there is a way to work out that. Hope it is possible.

Elzed

database • 12k views
4
Entering edit mode

w h y d i d y o u p a s t e y o u r t e x t a s u n i c o d e ? i t ' s u n r e a d a b l e t o m e

0
Entering edit mode

Sorry about that. I just changed it.

12
Entering edit mode
11.1 years ago
Jan Kosinski ★ 1.6k
1. Go to NCBI Entrez Protein search
2. Search with all [filter] query. This will give you all protein entries
3. Locate "Taxonomic Groups" box on the right. Display tree and locate "Green plants". Click and wait a moment. You should now see proteins from green plants. The query in the Search box should change to (all [filter]) AND "green plants"[porgn:__txid33090].

Now you can:

2. Using a custom script select from the nr database only those entries that have a GI from the "Plant GI list"

3. Create the final plant_nr using formatdb

OR:

use new blast where apparently you can filter the nr database based on gi list using '-gilist' option of blast itself! (http://www.ncbi.nlm.nih.gov/books/NBK1763/). But I haven't used that yet.

0
Entering edit mode

Thanks Jan. I think both of your suggestions are doable.

3
Entering edit mode
11.1 years ago
Goldbear ▴ 130

Alternatively, plantgdb maintains a uniprot-curated list of plant protein sequences.

This will extract out A LOT of sequence files that I wanted to join into one big file. I had to use xargs to get around the 'argument list too long' error that cat was giving me.

\$ find . -type f | xargs cat > out.txt

0
Entering edit mode

Good. But it seems that they do not have GI and accession number.