Is It Possible To Extract From Ncbi Nr Database Plant Protein Sequences.?
2
3
Entering edit mode
11.0 years ago
Lhl ▴ 730

Hi all,

I am doing annotation of a de novo sequenced non-model plant genome. Specifically, i am annotating the genome through blastx-ing the NCBI NR database. Obviously, it will be much faster if i can draw only plant protein sequences from the large NR database. So ia m wondering if there is a way to work out that. Hope it is possible.

Elzed

database • 12k views
ADD COMMENT
4
Entering edit mode

w h y d i d y o u p a s t e y o u r t e x t a s u n i c o d e ? i t ' s u n r e a d a b l e t o m e

ADD REPLY
0
Entering edit mode

Sorry about that. I just changed it.

ADD REPLY
12
Entering edit mode
11.0 years ago
Jan Kosinski ★ 1.6k
  1. Go to NCBI Entrez Protein search
  2. Search with all [filter] query. This will give you all protein entries
  3. Locate "Taxonomic Groups" box on the right. Display tree and locate "Green plants". Click and wait a moment. You should now see proteins from green plants. The query in the Search box should change to (all [filter]) AND "green plants"[porgn:__txid33090].
  4. Download everything as GI list. This is your "Plant GI list"

Now you can:

  1. Download full nr database in FASTA format

  2. Using a custom script select from the nr database only those entries that have a GI from the "Plant GI list"

  3. Create the final plant_nr using formatdb

OR:

use new blast where apparently you can filter the nr database based on gi list using '-gilist' option of blast itself! (http://www.ncbi.nlm.nih.gov/books/NBK1763/). But I haven't used that yet.

ADD COMMENT
0
Entering edit mode

Thanks Jan. I think both of your suggestions are doable.

ADD REPLY
3
Entering edit mode
11.0 years ago
Goldbear ▴ 130

Alternatively, plantgdb maintains a uniprot-curated list of plant protein sequences.

It should be the UniProt_Protein.tar.bz2 file in ftp://ftp.plantgdb.org/download/FASTA/

This will extract out A LOT of sequence files that I wanted to join into one big file. I had to use xargs to get around the 'argument list too long' error that cat was giving me.

$ find . -type f | xargs cat > out.txt

ADD COMMENT
0
Entering edit mode

Good. But it seems that they do not have GI and accession number.

ADD REPLY

Login before adding your answer.

Traffic: 885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6