Question

Create list of sequences present in multiple FASTA files

0

Entering edit mode

9.8 years ago

biostars • 0

Hi, I'm trying to make a list of amino acid sequences that are present in all of a selection of FASTA files I have. To make things confusing they all different feature IDs. Is there a script I can run that would be capable of doing this?

Thanks!

fasta • 2.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by biostars • 0

1

Entering edit mode

Please clarify your specific problem or add additional details to highlight exactly what you need.

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Your comments are supposed to be pasted in these boxes based on the forum rules.

Yes. You can automate using sed.

Eg:

sed 's/>/>file1_/g' file1.fasta >file1NamesChanged.fasta

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Prakki Rama ★ 2.7k

Ram · Answer 1 · 2014-06-27

0

Entering edit mode

9.8 years ago

Prakki Rama ★ 2.7k

One possibility can be

Change the headers in the each fasta file according to file name. Suppose, if the sequence in file1.fasta is >protein1, you can change it to >file1_protein1
Then merge all the fasta files into one file.
Run CD-HIT (with parameters like identity)

CD-HIT will then generate a list, which sequences are all similar and the representative sequence of the cluster. Because, you already have sequence header with file information in it, you will now easily know which proteins are present in multiple FASTA files.

~Prakki Rama.

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thanks Prakki, is there a way to automate the renaming? There are quite a few sequences and it would take a long time doing it manually.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by biostars • 0