Find overlapping sequences between pairs of protein fasta files
2
0
Entering edit mode
22 months ago
suzuBell ▴ 60

I have three protein .fasta files (file1.fasta, file2.fasta, and file3.fasta), each in the following format:

>MOLCJCMO_00002 [gene=cysS] [locus_tag=HCW_RS04050] [protein=cysteine--tRNA ligase] [protein_id=WP_014660951.1][location=complement(860733..862130)] [gbkey=CDS]
MKIFDTHLKQKVPFEPLIENQATIYVCGPTVYDDAHLGHARSAIVFDLLERTLTLSGYQV
TLIKNFTDIDDKIINKANQENIDITELSARYIQSYNQDMNALNIKTPNFKPKASHYIDAM
>MOLCJCMO_00003 [locus_tag=HCW_RS04040] [protein=ABC transporter ATP-binding protein] [protein_id=WP_014660949.1][location=855780..856547] [gbkey=CDS]
MFLEIEGLSFSYAPSKPILRDITFSVPKNCITSVLAPNGTGKTTLFKCILGILRPDAHSI
MRVDKQELGVLKPHEKARLIAYIPQEESNVFNFSVLDFVLMGKAARLNLFGAPSAKHIQE


My goal is to obtain the list of amino acid sequences that are in file2.fasta and/or file3.fasta but not in file1.fasta. These would ideally represent the genes present in file2 and/or file3 but not in file1. I am hesitant to simply look at the headers to accomplish this task in case the same amino acid sequence could have different headers in different files.

My files contain hundreds of sequences, so I need to find an automated way to accomplish this task. I am curious if anyone is aware of tools to accomplish this task (comparing gene overlaps between .fasta files). Thank you.

fasta overlap protein • 775 views
0
Entering edit mode
22 months ago

linearize the fasta sequences, extract and sort the sequences with cut, get the common sequences with comm.

0
Entering edit mode
22 months ago
Mensur Dlakic ★ 14k

When you say sequences that overlap between the two files, I assume you mean sequences that are identical.

Make a blast database out of file1. Search file2 and file3 against file1, choosing a single-match (top hit) and tabular output. Sort the output by third column (sequence identity between the query and match). All sequences that have 100 in that column need to be removed because they have an identical match in file1.