I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.
A simple tool would be helpful, but I tried to use
seqtk to no avail.
Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.
A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.
Any advice how to split a fasta file when the first two characters of the header is
>A so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with
A as the first letter so I don't want to split just on
A - I want to split before the first
A in a series only.