Question: Removing highly similar protein sequences from a fasta file
0
gravatar for sudarshan1993
2.8 years ago by
sudarshan199310 wrote:

Hi,

I have a fast file with about 500 protein sequences that have been compiled as the result of blast searches. Many of these sequences are quite similar to each other.

I would like to trim this fasta file such that only one copy of these highly similar sequences are left behind.

My approach so far has been to use the pairwise alignment tool in biopython, but this becomes very intractable as I will have to iterate over the file 250000 times.

Are there any alternatives/better methods to go about this?

Thanks!

ADD COMMENTlink written 2.8 years ago by sudarshan199310
1

CD-HIT is a popular choice to do this clustering.

ADD REPLYlink written 2.8 years ago by genomax68k

CD-HIT worked great, thanks!

ADD REPLYlink written 2.8 years ago by sudarshan199310

How did you compile your sequences from blast?

ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

Just used BLAST tools on Biopython.

ADD REPLYlink written 2.8 years ago by sudarshan199310

Are you looking for the most similar sequence to a db hit? If you did an output format such as tabular output, you can go back to your blast results, and group all the sequences that hit the same database subject sequence. Then find the one with the highest % aln, least number of mismatches, lowest e-value, etc. Also, -max_target_seqs = 1 should help you here to get one hit per query.

ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

You don't have to iterate 250000 times, the maximum iteration would be 125250 times (you compare the first against the remaining 499, then the second against the remaining 489 etc.).

ADD REPLYlink written 2.8 years ago by Markus250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1578 users visited in the last hour