I have a multifast file that contains similar nucleotide sequence with different identifier. I want to remove all those duplicated sequence (based on nucleotide rather than their identifier). I have searched similar problem on internet and also on biostar page How To Remove The Same Sequences In The Fasta Files?. I tried the following command with small data set that contain only 14 entries. It worked well with this file. But when I am trying to run this to original file it is not working, by giving all the sequencs instead of only unique sequence. (I know that my file contains around 40000 sequences out of which 1500 have similar nucleotide sequences)
what I tried is :
sed 's/\(^>.*$\)/@\1#/' test_file | tr -d '\n' | tr "@" "\n" | tr "#" "\t" | sed '/^$/d' | sort -u -k 2,2
Does anybody have idea what I did wrong ?
It sounds weird that the command is working with a little subset but not with the biggest. Have you tried the other answers of the mentioned post? I'd say that
fastx-toolkitsolution is better than
sedone for big datasets.
sedcan become quite slow.
For other hand, if you are still interested in doing the task with
sed, I'd try to get a little subset of the result (where you still have the duplicates), and run the command again, and try to figure out what its going on. It's difficult to help you without knowing the input and the output.
I second that, I think the fastx toolkit is very suitable for this. Why not re-use tools? Some-one else has put effort and energy in developing this tool and you can use it for free! Don't waste your time reinventing the wheel I would say...