Question: Is It Possible To Extract From Ncbi Nr Database Plant Protein Sequences.?
3
gravatar for Lhl
9.8 years ago by
Lhl730
United States
Lhl730 wrote:

Hi all,

I am doing annotation of a de novo sequenced non-model plant genome. Specifically, i am annotating the genome through blastx-ing the NCBI NR database. Obviously, it will be much faster if i can draw only plant protein sequences from the large NR database. So ia m wondering if there is a way to work out that. Hope it is possible.

Elzed

database • 11k views
ADD COMMENTlink modified 9.8 years ago by Goldbear130 • written 9.8 years ago by Lhl730
4

w h y d i d y o u p a s t e y o u r t e x t a s u n i c o d e ? i t ' s u n r e a d a b l e t o m e

ADD REPLYlink written 9.8 years ago by Pierre Lindenbaum134k

Sorry about that. I just changed it.

ADD REPLYlink written 9.8 years ago by Lhl730
10
gravatar for Jan Kosinski
9.8 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:
  1. Go to NCBI Entrez Protein search
  2. Search with all [filter] query. This will give you all protein entries
  3. Locate "Taxonomic Groups" box on the right. Display tree and locate "Green plants". Click and wait a moment. You should now see proteins from green plants. The query in the Search box should change to (all [filter]) AND "green plants"[porgn:__txid33090].
  4. Download everything as GI list. This is your "Plant GI list"

Now you can:

  1. Download full nr database in FASTA format

  2. Using a custom script select from the nr database only those entries that have a GI from the "Plant GI list"

  3. Create the final plant_nr using formatdb

OR:

use new blast where apparently you can filter the nr database based on gi list using '-gilist' option of blast itself! (http://www.ncbi.nlm.nih.gov/books/NBK1763/). But I haven't used that yet.

ADD COMMENTlink written 9.8 years ago by Jan Kosinski1.6k

Thanks Jan. I think both of your suggestions are doable.

ADD REPLYlink written 9.8 years ago by Lhl730
3
gravatar for Goldbear
9.8 years ago by
Goldbear130
Goldbear130 wrote:

Alternatively, plantgdb maintains a uniprot-curated list of plant protein sequences.

It should be the UniProt_Protein.tar.bz2 file in ftp://ftp.plantgdb.org/download/FASTA/

This will extract out A LOT of sequence files that I wanted to join into one big file. I had to use xargs to get around the 'argument list too long' error that cat was giving me.

$ find . -type f | xargs cat > out.txt

ADD COMMENTlink written 9.8 years ago by Goldbear130

Good. But it seems that they do not have GI and accession number.

ADD REPLYlink written 9.8 years ago by Lhl730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour
_