Dear Friends, I have punch of proteins in a particular family from different species for eg. Species 1: 240, Species: 100 and Species3: 300. I want to do comparative analysis between species. Before that I must remove reduntant or overlapping proteins with in the species. I have already selected the proteins based on HMMER from the repective species. I need to screen more to get the actual number of proteins present in the each species of particular family. For that I am using BLAST+ to blast each species itself and remove the sequences having more than 70% identity by considering the sequences have more than 70% as one. So that I keep one sequence and remove all others in the one group. Like this I can trim protein seqeunces. For eg., Species 1: 240 into some around 70 or 80 depends on its similarities. I am not sure whether this idea is correct or not ...? If the idea is correct, is it possible to fix the percentage identity in the blast+ using -perc_identity 70. I have tried it but it shows "Unknown argument: "perc_identity"". Could anyone suggest me to get ride of similar or repeated sequence in the family.
Consider an alternative approach by collapsing the redundant sequences using the CD-HIT suite of tools.
If -perc_identity is not recognized you most likely have made a mistake when typing the command line or you're using a program that doesn't support this option. Have you also looked at the qcov_hsp_perc option ? Also blast may not be the best tool for this if you're considering the whole sequence because blast is a local alignment algorithm.