Question: NCBI vs Ensembl - which one to chose - for downloading protein fasta files
0
gravatar for Idit
21 months ago by
Idit0
Israel
Idit0 wrote:

Hi

[I'm a newbie in bioinformatics, my apologies for misusage of terms, if any..]

I need to decide which resource to use, to download many species full protein fasta files, in order to run many blastp queries for all human proteins against each of the species. I would like to download most of the Eukaryotes species files that exist. I checked some species from both Ensembl and NCBI latest releases, and saw that there are big differences between them.

For example, when I downloaded the protein fasta file of "Otolemur garnettii", The Ensembl fasta has 19986 proteins, whereas the NCBI fasta has 26925. When running a sample blastp for some human protein sequence against each of these protein files (after running makeblastdb of course), the highest bitscore is very different between Ensembl & NCBI.

Also, when I run blastp for the same species, Ensembl vs NCBI and vice versa, I get > 1000 proteins with %identity < 30, which I understand as proteins that exists in one resource and not in the other one (?)

I know they use different gene annotation methods, so it makes sense there are differences, but my question is, did you have experience with working with both resources, and do you have any recommendations, which resource to chose to work with?

Thanks a lot,

Idit

ensembl blast ncbi fasta • 1.6k views
ADD COMMENTlink modified 21 months ago by Jean-Karim Heriche18k • written 21 months ago by Idit0
2

Uniprot has built a database of reference proteomes for most organisms sequences today: http://www.uniprot.org/proteomes/

ADD REPLYlink written 21 months ago by a.zielezinski8.5k

Thanks, I downloaded all the Eukaryotes I needed from the UniProt FTP site

ADD REPLYlink written 20 months ago by Idit0
4
gravatar for Jean-Karim Heriche
21 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

Different resources have differences because they do not have the same focus. For example, EnsEMBL is about annotating genomes whereas UniProt is about collecting and annotating proteins and thus doesn't have a notion of underlying genome. If you need data integration at the genome level, e.g. you need to refer to genes at some point, then you're better off working with a well organized genome annotation resource like EnsEMBL which already has integrated plenty of information. Whichever resource you choose make sure you understand what it is about and how this impacts your work. Also don't try a mix and match approach between resources, this is asking for trouble.

ADD COMMENTlink written 21 months ago by Jean-Karim Heriche18k

For this project I only need the protein sequences and not the genomic annotation, so it looks like I will go for UniProt. Thanks for the mix & match warning, I almost did it..

ADD REPLYlink written 20 months ago by Idit0
1
gravatar for Whoknows
21 months ago by
Whoknows670
Tehran,Iran
Whoknows670 wrote:

None !!

It is better to download from UniProt, also you could download Refseq protein website NCBI, but in my experience UniProt gives more information and is much updated than NCBI and Ensemble.

The other advantage of UniProt is you could obtain SWISS-prot manually curated entries or TrEMBL for in-silico predicted protein.

ADD COMMENTlink written 21 months ago by Whoknows670

Thanks, this is what I did. It is still interesting to see that for some species UniProt has the almost the same set of proteins as NCBI, and for other species it's more close to Ensembl.

ADD REPLYlink written 20 months ago by Idit0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1903 users visited in the last hour