Question

Find overlapping sequences between pairs of protein fasta files

0

Entering edit mode

4.4 years ago

suzuBell ▴ 60

I have three protein .fasta files (file1.fasta, file2.fasta, and file3.fasta), each in the following format:

>MOLCJCMO_00002 [gene=cysS] [locus_tag=HCW_RS04050] [protein=cysteine--tRNA ligase] [protein_id=WP_014660951.1][location=complement(860733..862130)] [gbkey=CDS]
MKIFDTHLKQKVPFEPLIENQATIYVCGPTVYDDAHLGHARSAIVFDLLERTLTLSGYQV
TLIKNFTDIDDKIINKANQENIDITELSARYIQSYNQDMNALNIKTPNFKPKASHYIDAM
>MOLCJCMO_00003 [locus_tag=HCW_RS04040] [protein=ABC transporter ATP-binding protein] [protein_id=WP_014660949.1][location=855780..856547] [gbkey=CDS]
MFLEIEGLSFSYAPSKPILRDITFSVPKNCITSVLAPNGTGKTTLFKCILGILRPDAHSI
MRVDKQELGVLKPHEKARLIAYIPQEESNVFNFSVLDFVLMGKAARLNLFGAPSAKHIQE

My goal is to obtain the list of amino acid sequences that are in file2.fasta and/or file3.fasta but not in file1.fasta. These would ideally represent the genes present in file2 and/or file3 but not in file1. I am hesitant to simply look at the headers to accomplish this task in case the same amino acid sequence could have different headers in different files.

My files contain hundreds of sequences, so I need to find an automated way to accomplish this task. I am curious if anyone is aware of tools to accomplish this task (comparing gene overlaps between .fasta files). Thank you.

fasta overlap protein • 1.6k views

ADD COMMENT • link updated 4.4 years ago by Mensur Dlakic ★ 27k • written 4.4 years ago by suzuBell ▴ 60

score 0 · Answer 1 · 2019-12-11

0

Entering edit mode

4.4 years ago

Pierre Lindenbaum 161k

linearize the fasta sequences, extract and sort the sequences with cut, get the common sequences with comm.

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 161k

score 0 · Answer 2 · 2019-12-11

When you say sequences that overlap between the two files, I assume you mean sequences that are identical.

Make a blast database out of file1. Search file2 and file3 against file1, choosing a single-match (top hit) and tabular output. Sort the output by third column (sequence identity between the query and match). All sequences that have 100 in that column need to be removed because they have an identical match in file1.