I have hundreds of large fasta files (with almost hundred sequences in each). In some files, I have duplicate sequence NAMES but the sequence itself is NOT a duplicate. I have found other similar posts, but they want to remove duplicate sequences: How To Remove The Same Sequences In The Fasta Files?, Remove Duplicates In Fasta (Protein Seq.)
I simply want to keep one sequence (for which multiple have the same name) and remove the others that have the same name (sequence themselves aren't unique). I'd think this could be a simple bash command but can't find a solution. I thought first to try and count the number of non-unique sequence names, but even that didn't work:
grep ">" fasta.file | uniq -c
1 >Sample1 1 >Sample2 1 >Sample3 1 >Sample1
Any suggestions for a simple bash script or other? Here's my sample fasta:
>Sample1 tctctccttt >Sample2 tctctcattt >Sample3 tctctccttt >Sample1 tctctccttg
So in this case, I want to keep the first Sample1 and remove the second. I have very limited scripting/bioinformatic experience so I greatly appreciate the help.