Question: Find overlapping sequences between pairs of protein fasta files
0
gravatar for suzuBell
6 months ago by
suzuBell60
suzuBell60 wrote:

I have three protein .fasta files (file1.fasta, file2.fasta, and file3.fasta), each in the following format:

>MOLCJCMO_00002 [gene=cysS] [locus_tag=HCW_RS04050] [protein=cysteine--tRNA ligase] [protein_id=WP_014660951.1][location=complement(860733..862130)] [gbkey=CDS]
MKIFDTHLKQKVPFEPLIENQATIYVCGPTVYDDAHLGHARSAIVFDLLERTLTLSGYQV
TLIKNFTDIDDKIINKANQENIDITELSARYIQSYNQDMNALNIKTPNFKPKASHYIDAM
>MOLCJCMO_00003 [locus_tag=HCW_RS04040] [protein=ABC transporter ATP-binding protein] [protein_id=WP_014660949.1][location=855780..856547] [gbkey=CDS]
MFLEIEGLSFSYAPSKPILRDITFSVPKNCITSVLAPNGTGKTTLFKCILGILRPDAHSI
MRVDKQELGVLKPHEKARLIAYIPQEESNVFNFSVLDFVLMGKAARLNLFGAPSAKHIQE

My goal is to obtain the list of amino acid sequences that are in file2.fasta and/or file3.fasta but not in file1.fasta. These would ideally represent the genes present in file2 and/or file3 but not in file1. I am hesitant to simply look at the headers to accomplish this task in case the same amino acid sequence could have different headers in different files.

My files contain hundreds of sequences, so I need to find an automated way to accomplish this task. I am curious if anyone is aware of tools to accomplish this task (comparing gene overlaps between .fasta files). Thank you.

protein overlap fasta • 254 views
ADD COMMENTlink modified 6 months ago by Mensur Dlakic5.8k • written 6 months ago by suzuBell60
0
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

linearize the fasta sequences, extract and sort the sequences with cut, get the common sequences with comm.

ADD COMMENTlink written 6 months ago by Pierre Lindenbaum129k
0
gravatar for Mensur Dlakic
6 months ago by
Mensur Dlakic5.8k
USA
Mensur Dlakic5.8k wrote:

When you say sequences that overlap between the two files, I assume you mean sequences that are identical.

Make a blast database out of file1. Search file2 and file3 against file1, choosing a single-match (top hit) and tabular output. Sort the output by third column (sequence identity between the query and match). All sequences that have 100 in that column need to be removed because they have an identical match in file1.

ADD COMMENTlink written 6 months ago by Mensur Dlakic5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 833 users visited in the last hour