Question

How To Find 50 Homolgous Sequences But Not So Close Related?

0

Entering edit mode

10.7 years ago

onpelikan • 0

Hi, I'm searching for e.g. 50 sequences in Not redudundat blast database. I want to test program for protein mutation prediction - program tries to estimate if mutation is deleterious or neutral.

Example of analyzed sequence is well known lacI repressor. Blast finds lot of sequences but too much similar. First 50 sequences are almost the same and prediction program has no heterogentity for it's prediction model.

How to find homogous sequences but not the same (I want orthologs). E. g. sequences from another species and little bit different than human LacI protein.

I tried classic blastp. Another way I tried: first run blastp for 2000 sequences and then align these sequences and this alignment get to psiblast as PSSM (-in_msa parameter). Is there other automatic way or parameter settings for Blast+ package to find more distant sequences?

EDIT: Constraint - searching process have to be automatic. It is one of the component of a bigger tool.

blast • 3.7k views

ADD COMMENT • link updated 10.7 years ago by Spitshine ▴ 660 • written 10.7 years ago by onpelikan • 0

0

Entering edit mode

I would guess you need to define some sort of constraints - i.e. (1) bitscore thresholds, (2) species subset (or a distance) and (3) conserved domain(s), and then see which blast hits will satisfy these.

ADD REPLY • link 10.7 years ago by Pavel Senin ★ 1.9k

Ram · Answer 1 · 2013-12-29

2

Entering edit mode

10.7 years ago

5heikki 11k

You could filter tabular blast output with e.g. awk to only include hits that have smaller than whatever similarity percentage:

awk '$3 <= 95 {print}' tabularBlastOutputFile | awk '$3 >= 85 {print}' > hitsBetween85And95SimilarityPercentage

ADD COMMENT • link updated 4.7 years ago by Ram 44k • written 10.7 years ago by 5heikki 11k

score 2 · Answer 2 · 2013-12-31

2

Entering edit mode

10.7 years ago

Manu Prestat 4.1k

You're looking for a search with an improved sensitivity. Try a profile-based search, e.g. HMMer with pfam.

ADD COMMENT • link 10.7 years ago by Manu Prestat 4.1k

0

Entering edit mode

HMMer returns lot of sequences so I clustered it with cd-hit and this process got the best results for mutation analysis with MAPP program.

ADD REPLY • link 10.6 years ago by onpelikan • 0

score 1 · Answer 3 · 2013-12-30

1

Entering edit mode

10.7 years ago

jackuser1979 ▴ 890

You can do with BLASTO blast designed for orthologue search. Try search in eggNOG database or DRSC tool.

ADD COMMENT • link 10.7 years ago by jackuser1979 ▴ 890

0

Entering edit mode

Thank you. This is really interesting projects/tools but I need command line program (such as blast+ programs).

ADD REPLY • link 10.7 years ago by onpelikan • 0

0

Entering edit mode

Is there please any way to download all sequences in fasta? I can't see anything.

ADD REPLY • link 10.7 years ago by onpelikan • 0

score 1 · Answer 4 · 2013-12-31

1

Entering edit mode

10.7 years ago

Asaf 10k

You can run PSI-BLAST and choose the proteins you get in the second or third iteration.

ADD COMMENT • link 10.7 years ago by Asaf 10k

1

Entering edit mode

And by the way, your question reminds me of the construction of BLOSUM, maybe you'll find interesting insights in the original paper.

ADD REPLY • link 10.7 years ago by Asaf 10k

0

Entering edit mode

This is another good advice.

ADD REPLY • link 10.7 years ago by Manu Prestat 4.1k

0

Entering edit mode

1) I need the blast to be automatic process without manual work.

2) I will check the original paper. Thank you.

ADD REPLY • link 10.7 years ago by onpelikan • 0

score 1 · Answer 5 · 2014-01-03

1

Entering edit mode

10.7 years ago

Spitshine ▴ 660

If you do not want to rely on an orthologous groups database, modify your input set to include diverse sequences by cd-hit (http://weizhong-lab.ucsd.edu/cd-hit/).

This is how protein families were built in the olden days of biocomputing.

ADD COMMENT • link 10.7 years ago by Spitshine ▴ 660

0

Entering edit mode

This is probably one of the best solution. One possible is let blastp search e.g. 3000 sequences and then obtain 50 representative sequences from cd-hit clustering .

ADD REPLY • link 10.7 years ago by onpelikan • 0