I have three protein .fasta files (file1.fasta, file2.fasta, and file3.fasta), each in the following format:
>MOLCJCMO_00002 [gene=cysS] [locus_tag=HCW_RS04050] [protein=cysteine--tRNA ligase] [protein_id=WP_014660951.1][location=complement(860733..862130)] [gbkey=CDS]
MKIFDTHLKQKVPFEPLIENQATIYVCGPTVYDDAHLGHARSAIVFDLLERTLTLSGYQV
TLIKNFTDIDDKIINKANQENIDITELSARYIQSYNQDMNALNIKTPNFKPKASHYIDAM
>MOLCJCMO_00003 [locus_tag=HCW_RS04040] [protein=ABC transporter ATP-binding protein] [protein_id=WP_014660949.1][location=855780..856547] [gbkey=CDS]
MFLEIEGLSFSYAPSKPILRDITFSVPKNCITSVLAPNGTGKTTLFKCILGILRPDAHSI
MRVDKQELGVLKPHEKARLIAYIPQEESNVFNFSVLDFVLMGKAARLNLFGAPSAKHIQE
My goal is to obtain the list of amino acid sequences that are in file2.fasta and/or file3.fasta but not in file1.fasta. These would ideally represent the genes present in file2 and/or file3 but not in file1. I am hesitant to simply look at the headers to accomplish this task in case the same amino acid sequence could have different headers in different files.
My files contain hundreds of sequences, so I need to find an automated way to accomplish this task. I am curious if anyone is aware of tools to accomplish this task (comparing gene overlaps between .fasta files). Thank you.