Question

By What Criterion Can We Screen The Blast Results?

2

Entering edit mode

13.0 years ago

Zhizhong ▴ 270

I used to select the first several sequences as the ideal sequences I want to get. But there are cases that a lot of sequences with high scores and low evalues will retrieve after blasting. When it happens I was suggested to pick up some sequences randomly and exclude the sequences from same organism. Is it ok? Or Does any other better solutions exist?

blast • 2.4k views

ADD COMMENT • link updated 13.0 years ago by Jan Kosinski ★ 1.6k • written 13.0 years ago by Zhizhong ▴ 270

score 6 · Answer 1 · 2011-05-16

6

Entering edit mode

13.0 years ago

Jan Kosinski ★ 1.6k

It depends on biological question you want to answer with your BLAST search, and probably whether you deal with DNA or protein sequences.

I can tell you what I do to create multiple sequence alignments of a query protein family based on BLAST result.

retrieve all BLAST hits up to very bad E-value
cluster hits using CLANS IMPORTANT: use only regions of hits that are aligned to query, otherwise clustering can get wrong due to other domains in hits
Select those hits which appear to belong to my family based on the clustering.
Retrieve full sequences of hits
Align them using mafft or other alignment building program
For further analysis, like phylogenetic analysis, use alignments that are filtered versions of the above. Depending on your problem you need different filtering method.

Blindly excluding sequences from the same organism is dangerous, they may be paralogs, but again it depends what you want to answer.

ADD COMMENT • link 13.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Thanks for your useful answer. when blasting some short sequences such as 16s RNA of bacteriums, the results can be that all the sequence retrieved with indentical E-value and score. If I want to build phylogenetic trees, what should I do with the sequences?

ADD REPLY • link 12.9 years ago by Zhizhong ▴ 270

0

Entering edit mode

I would analyze the whole set manually, for example: 1) align ALL sequences, 2) group by sequence similarity, most of sequence alignment programs do that by default 3) for every group of sequences referring to the same GeneID take one, with a length most close to average length of seq in the alignment, and without regions clearly dissimilar to other seqs in the alignment 4) for every group of nearly identical sequences from the same species, take one only from representative strain (e.g. from all E.coli strains take only those from K12 strain).

ADD REPLY • link 12.9 years ago by Jan Kosinski ★ 1.6k