Removing highly similar protein sequences from a fasta file
0
0
Entering edit mode
7.7 years ago

Hi,

I have a fast file with about 500 protein sequences that have been compiled as the result of blast searches. Many of these sequences are quite similar to each other.

I would like to trim this fasta file such that only one copy of these highly similar sequences are left behind.

My approach so far has been to use the pairwise alignment tool in biopython, but this becomes very intractable as I will have to iterate over the file 250000 times.

Are there any alternatives/better methods to go about this?

Thanks!

fasta protein sequences sequence blast alignment • 2.8k views
ADD COMMENT
1
Entering edit mode

CD-HIT is a popular choice to do this clustering.

ADD REPLY
0
Entering edit mode

CD-HIT worked great, thanks!

ADD REPLY
0
Entering edit mode

How did you compile your sequences from blast?

ADD REPLY
0
Entering edit mode

Just used BLAST tools on Biopython.

ADD REPLY
0
Entering edit mode

Are you looking for the most similar sequence to a db hit? If you did an output format such as tabular output, you can go back to your blast results, and group all the sequences that hit the same database subject sequence. Then find the one with the highest % aln, least number of mismatches, lowest e-value, etc. Also, -max_target_seqs = 1 should help you here to get one hit per query.

ADD REPLY
0
Entering edit mode

You don't have to iterate 250000 times, the maximum iteration would be 125250 times (you compare the first against the remaining 499, then the second against the remaining 489 etc.).

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6