Removing highly similar protein sequences from a fasta file

0

Entering edit mode

7.7 years ago

sudarshan1993 ▴ 10

Hi,

I have a fast file with about 500 protein sequences that have been compiled as the result of blast searches. Many of these sequences are quite similar to each other.

I would like to trim this fasta file such that only one copy of these highly similar sequences are left behind.

My approach so far has been to use the pairwise alignment tool in biopython, but this becomes very intractable as I will have to iterate over the file 250000 times.

Are there any alternatives/better methods to go about this?

Thanks!

fasta protein sequences sequence blast alignment • 2.8k views

ADD COMMENT • link 7.7 years ago by sudarshan1993 ▴ 10

1

Entering edit mode

CD-HIT is a popular choice to do this clustering.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

CD-HIT worked great, thanks!

ADD REPLY • link 7.7 years ago by sudarshan1993 ▴ 10

0

Entering edit mode

How did you compile your sequences from blast?

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Just used BLAST tools on Biopython.

ADD REPLY • link 7.7 years ago by sudarshan1993 ▴ 10

0

Entering edit mode

Are you looking for the most similar sequence to a db hit? If you did an output format such as tabular output, you can go back to your blast results, and group all the sequences that hit the same database subject sequence. Then find the one with the highest % aln, least number of mismatches, lowest e-value, etc. Also, -max_target_seqs = 1 should help you here to get one hit per query.

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

You don't have to iterate 250000 times, the maximum iteration would be 125250 times (you compare the first against the remaining 499, then the second against the remaining 489 etc.).

ADD REPLY • link 7.7 years ago by Markus ▴ 320

Login before adding your answer.