I have a multifast file that contains similar nucleotide sequence with different identifier. I want to remove all those duplicated sequence (based on nucleotide rather than their identifier). I have searched similar problem on internet and also on biostar page How To Remove The Same Sequences In The Fasta Files?. I tried the following command with small data set that contain only 14 entries. It worked well with this file. But when I am trying to run this to original file it is not working, by giving all the sequencs instead of only unique sequence. (I know that my file contains around 40000 sequences out of which 1500 have similar nucleotide sequences)
what I tried is :
sed 's/\(^>.*$\)/@\1#/' test_file | tr -d '\n' | tr "@" "\n" | tr "#" "\t" | sed '/^$/d' | sort -u -k 2,2
Does anybody have idea what I did wrong ?