I have a large fasta file of 16S sequences and I want to retrieve sequences using a list of organism names. Do you know a script capable of doing it?
EDIT:
Headers look like that:
>S000000859 Bacillus sp. USC14; AF346495
sequence
>S000001027 Paenibacillus borealis; KN25; AJ011325
sequence
And I have a list like the following:
Paenibacilus borealis
Paenibacillus sp. 1-18
Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11
I want to retrieve those sequences that match with names present in the list.
Can you show us some example headers from your FASTA file? Right now I'm thinking to put all the organism names in a text file and simply use
grep -f
I believe I was not clear. My 16S fasta file has sequences of hundreds of species. I have to retrieve just dozens of them using a list of organisms of interest.
If the fasta sequence is in one line it's":
If it's not it would be easier to format it into one line and then do grep.
A good way to start. Very simple. Thanks. The problem is when the sequences have different sizes.