I have a list of protein sequences and a list of peptides I want to assign to the protein sequences.
I tried cd-hit-2d (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) for this:
cd-hit-2d -d 0 -i proteins.faa -i2 peptides.faa -o matched_peptides -c 1.0
For some sequences I don't get a match at all where there should be a match:
$ grep -P "pep5\." *clstr $
>pep5 SVVLLDEVEK >PROKKA_43260 ...APY___SVVLLDEVEK___AHPDVLEMFFQVFDKGLMDDAEGREIDFRNTVIIL TSNAGSQHIMQACFEKDEELGGAV...
Can this be because the peptide-sequence is too short?
For others I noticed that they appear in 1 cluster but have identical matches to multiple proteins:
$ grep -P "pep7\." *clstr -C 5 >Cluster 57774 0 502aa, >PROKKA_167265... * 1 12aa, >pep7... at 100.00% 2 11aa, >pep318... at 100.00%
However it should match more than once:
>pep7 VVNPLGEPIDGK >PROKKA_167265 ...ILGEYKHIEEGFTVKRTGTIFSVPVG EGMLGR____VVNPLGEPIDGK____GPIQT... >PROKKA_136748 ....VILGEYKHIEEGFTVKRTGTIFSVPVG EAMLGR____VVNPLGEPIDGK____GPILTDKVRPV...
Is this a general behavior of cd-hit to assign only the first match to a cluster and is there a way to control or change this?
Overall, is this tool (https://research.bioinformatics.udel.edu/peptidematch/commandlinetool.jsp) better suited for this kind of task? I would also appreciate other suggestions, also allowing a certain number of mismatches.