By What Criterion Can We Screen The Blast Results?
1
2
Entering edit mode
13.0 years ago
Zhizhong ▴ 270

I used to select the first several sequences as the ideal sequences I want to get. But there are cases that a lot of sequences with high scores and low evalues will retrieve after blasting. When it happens I was suggested to pick up some sequences randomly and exclude the sequences from same organism. Is it ok? Or Does any other better solutions exist?

blast • 2.4k views
ADD COMMENT
6
Entering edit mode
13.0 years ago
Jan Kosinski ★ 1.6k

It depends on biological question you want to answer with your BLAST search, and probably whether you deal with DNA or protein sequences.

I can tell you what I do to create multiple sequence alignments of a query protein family based on BLAST result.

  1. retrieve all BLAST hits up to very bad E-value
  2. cluster hits using CLANS IMPORTANT: use only regions of hits that are aligned to query, otherwise clustering can get wrong due to other domains in hits
  3. Select those hits which appear to belong to my family based on the clustering.
  4. Retrieve full sequences of hits
  5. Align them using mafft or other alignment building program
  6. For further analysis, like phylogenetic analysis, use alignments that are filtered versions of the above. Depending on your problem you need different filtering method.

Blindly excluding sequences from the same organism is dangerous, they may be paralogs, but again it depends what you want to answer.

ADD COMMENT
0
Entering edit mode

Thanks for your useful answer. when blasting some short sequences such as 16s RNA of bacteriums, the results can be that all the sequence retrieved with indentical E-value and score. If I want to build phylogenetic trees, what should I do with the sequences?

ADD REPLY
0
Entering edit mode

I would analyze the whole set manually, for example: 1) align ALL sequences, 2) group by sequence similarity, most of sequence alignment programs do that by default 3) for every group of sequences referring to the same GeneID take one, with a length most close to average length of seq in the alignment, and without regions clearly dissimilar to other seqs in the alignment 4) for every group of nearly identical sequences from the same species, take one only from representative strain (e.g. from all E.coli strains take only those from K12 strain).

ADD REPLY

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6