Question: Ncbi Non-Redundant Dataset (Nr) In Protein-Blast To Look For Homologs?
1
gravatar for miquelduranfrigola
7.2 years ago by
Barcelona
miquelduranfrigola760 wrote:

Hi all,

this must be very basic, but still. I have a protein sequence for which I want to find homologs. I go to BLAST and do, for simplicity here, a regular BLASTp.

I know that blasting against refseq_protein or swissprot is common practice, but how about nr (non-redundant protein sequences)? This includes "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects", and as far as I've seen, it includes not only hypothetical proteins, but also different instances of the same protein (e.g. different combinations of PDB chains, etc.)

Would you guys consider a BLAST search against nr a proper "finding-homologs" exercise?

Thanks!

Miquel

blast • 10k views
ADD COMMENTlink modified 4.1 years ago by David Managadze40 • written 7.2 years ago by miquelduranfrigola760
3
gravatar for cdsouthan
7.2 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

Miquel, The easiest way to start off your homologue collection is via Ensembl (orthologues and paralogues) and TreeFam (orthologues). This will save you a lot of BLASTING around. You are right in that "nr" is actually highly redundant for many reasons. Thus a BLAST against UniProt 90 is much cleaner. If you really want "all" you will have to also TBLASTN against the EST and TSA divisions.... a tough job

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by cdsouthan1.8k
2
gravatar for John Van Dam
7.2 years ago by
John Van Dam90
Netherlands
John Van Dam90 wrote:

Hi Miquel,

The answer depends a bit on what it is you exactly want. Do you just want to see if there are "any" homologs? Or are you looking for specific homologs (e.g. homologs in C. elegans)?

If you want to find "any" homologs nr is fine. If you are looking for more specific homologs, other databases and settings may be more suitable. You could for instance blastp against a protein set (refseq) of a specific organism. Please remember that e-values are database size dependent and hits with just-below-threshold e-values can become insignificant in large databases such as nr.

Cheers, John

ADD COMMENTlink written 7.2 years ago by John Van Dam90

Thanks John. I want to see if there are "any" homologs (and, ideally, I'd like to find as many as possible). The problem I've found with "nr" is that sometimes I retrieve several instances of the same protein, perhaps with different lengths for whatever reason, which makes me doubt about its validity to find a proper collection of homologs.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by miquelduranfrigola760

Any idea about command line options for blasting against protein db of specific organism (e.g. Homo sapiens)

Thanks

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Anushka20
1
gravatar for Biojl
7.2 years ago by
Biojl1.7k
Barcelona
Biojl1.7k wrote:

I agree with cdsouthan, Ensembl might be a good choice for you... as long as you are interested mainly in vertebrates.

You might want to take a look to the new Ensembl REST api, where you can programatically retrieve all the homologs for a certain Id (comparative genomics section). It supports several programming languages.

http://beta.rest.ensembl.org/

In addition you could check the algorithm used in Ensembl to find orthology and homology relations, which is partially based in BLAST. It might give you some ideas. http://useast.ensembl.org/info/docs/compara/homology_method.html

ADD COMMENTlink written 7.2 years ago by Biojl1.7k
0
gravatar for David Managadze
4.1 years ago by
United States
David Managadze40 wrote:

BLAST tells you about sequence similarity but it is not enough to tell that two genes are homologs. If you have protein accessions of RefSeq, you could simply go to its page at NCBI, e.g.

http://www.ncbi.nlm.nih.gov/protein/NP_000005 Then in the right sidebar, in the section called Related information, find and click on "HomoloGene". You could also simply go to HomoloGene service and search for your proteins directly there, e.g.

http://www.ncbi.nlm.nih.gov/homologene/?term=NP_000005

HomoloGene also provides data on FTP. You can download file here and see all the proteins and genes in the current dataset.

ADD COMMENTlink modified 22 days ago by RamRS25k • written 4.1 years ago by David Managadze40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 824 users visited in the last hour