I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner >name, which in my case is numeric (gi).
Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed?
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
There are no spaces between characters or lines in my file.
linerarize,
sortusing options-k1,1 -u, move back to fasta usingtrIs this correct?
Update, the above (from Pierre Lindenbaum) does the job. Very good.
The only thing, I dropped the
-t ' 'and-fflags in sort (didn't seem necessary?). And the first line in the output file gives a single line of>, which I manually deleted.