Hi All,
I do find some similar questions, but I don't know how to apply the answers to solve my problem.
I guess it's very simple to everyone here.
I have a fast file containing 2000+ sequences with headers like:
>VFG0001 fhaB filamentous hemagglutinin/adhesin [FHA] [Bordetella pertussis]
the content in the last [ ]
is the bacteria name.
Also for some other sequences, the header is like
>VFG2304 misL putative autotransporter [MisL] [Salmonella enterica (serovar typhimurium)]
So I just want to extract sequences that have headers containing key word [Escherichia coli]
.
I also want to extract sequences that have headers containing key word [Salmonella enterica]
(I don't care about server within ()
) into a separate fasta file.
How can I simply do this on Mac server (maybe not use python or perl, just simple script)?
Thank you
Thank you so much. It is very helpful. Just a little modification:
because with
[
the terminal saidunmatched [
.Hi ,actually there is a problem with the code. After extract the sequences, only the first line of sequences with header is shown in new file, so all the extracted sequences are not complete.
Well, I figured out sequences of the the original database are multiple lines instead of single line, so I just convert all sequences into a single line file and the problem solved.
Or you could use Awk without needing to convert the multiple lines to single lines
https://infoplatter.wordpress.com/2013/10/15/extracting-specific-fasta-records-from-a-multi-fasta-file/comment-page-1/
Sorry, I've been spending too much time analyzing short-read sequences, and overlooked multi-line FASTA. Glad you worked out a solution.