Hi, I'm trying to make a list of amino acid sequences that are present in all of a selection of FASTA files I have. To make things confusing they all different feature IDs. Is there a script I can run that would be capable of doing this?
1) Change the headers in the each fasta file according to file name.
Suppose, if the sequence in file1.fasta is >protein1, you can change it to >file1_protein1
2) Then merge all the fasta files into one file.
3) Run CD-HIT (with parameters like identity)
CD-HIT will then generate a list, which sequences are all similar and the representative sequence of the cluster. Because, you already have sequence header with file information in it, you will now easily know which proteins are present in multiple FASTA files.