Question

Which refseq_protein db to choose for zingiberaceae

0

Entering edit mode

8 months ago

Nilo • 0

Hello everyone,

I am trying to blast a .gff file from which I extracted all protein sequences to a protein database from NCBI. The crop I have data on is from the zingiberaceae and I was wondering which database from this link I should use? https://ftp.ncbi.nlm.nih.gov/blast/db/ Hopefully one could tell me or even better, explain how to find the correct database so that I can do it myself in de future haha:)

They look all the same regarding the update date and file size. Which of the 35 options should I choose, or should I blast against all??

Thanks in advance!

Niels

local blastp blast • 539 views

ADD COMMENT • link updated 8 months ago by GenoMax 141k • written 8 months ago by Nilo • 0

2

Entering edit mode

So you have protein sequence data from one species (that you know which it is)? What are you trying to do with that sequence data? Are you looking to identify the proteins or looking to get homologs/orthologs from other related species?

They look all the same regarding the update date and file size.

No they don't. Files have the same dates because NCBI refreshes all database at the same time. Their sizes are wildly different. Check this readme file to understand what the different databases include: https://ftp.ncbi.nih.gov/blast/blastftp.txt

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thanks a lot!! So I had a fully assembled genome. I used AUGUSTUS arabidopsis database to annotate the genome and now I want to functionally annotate it to predict the function of the by augustus predicted genes.

Thank you for the list but I still donnot fully get how I should select a database. There seems to be one for H. sapiens but I am assuming I should find a similar one but then closely related to my crop. In the link I placed in my opening post there are databases as refseq_protein.1 till refseq_protein.35.

I think I am not understanding the whole picture of how to select a database or what these 35 different files are.

Could you explain it please?

Niels:)

ADD REPLY • link 8 months ago by Nilo • 0

1

Entering edit mode

If there is a relatively complete proteome available for a species near the one you are working with in UniProt then you could download that proteome. Make a blast database out of it and the get an idea of what your proteins may be functionally doing.

You could use swissprot or refseq_protein database as the next larger superset before going on to nr. Since these pre-formatted databases are large they are split in multiple files. You will need to download all pieces for a specific database (e.g. refseq_protein) and uncompress all file pieces in a single directory. You will then use the basename of the file (e.g. nr) with your -db option in blastp.

There are other packages like maker and orthoFinder that can help with annotations.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thanks a lot!! I downloaded the proteom of the normal ginger that we buy in every supermarket. I find that there are a lot of uncharacterized proteins still, but I managed to go from a .gff file, to a .xml output file containing the blast results and I merged that into a .bed file with the positional information from the original .gff file.

I noticed that a lot of the proteins are uncharachterized proteins

It now looks like this in IGV: enter image description here

enter image description here

ADD REPLY • link 8 months ago by Nilo • 0

1

Entering edit mode

They say $1000 genome but fail to note the $1M annotation that is required afterwards. Easy to sequence but much more difficult to annotate. If your aim is to fill this information in then a lot of additional work is going to be required.

ADD REPLY • link 8 months ago by GenoMax 141k