I'm working with fasta files containing sequences from different organisms, and for some of them I have more than one sequence. I would like to have only one representative sequence per organism, and I'd like it to be the longest one in each case. I've spent some time looking for an answer and learning to use some command line tools, but I couldn't get it right. My file kinda looks like this
>Mouse01 ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG >Mouse02 ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTG >Chimpanzee ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG >Human01 ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG >Human02 ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA
In this case, I'd like to keep Mouse01, Chimpanzee, and Human02.
The workflow, I think, would be:
1) Identify sequences of the same species by regex (e.g. Mouse, Human)
2) Count sequence length for species with more than one match
3) Keep only the longest sequence in species with more than one match, leave the rest (e.g. Chimpanzee) untouched.
I bet there must be some magical recipe or one-liner to do this using command line, but how would it look like?
Thanks from a very very rookie bioinformatic tools learner.