How to remove duplicated sequences from multi fasta file?
1
1
Entering edit mode
6.9 years ago
tcf.hcdg ▴ 70

Hello

I have a multifast file that contains similar nucleotide sequence with different identifier. I want to remove all those duplicated sequence (based on nucleotide rather than their identifier). I have searched similar problem on internet and also on biostar page How To Remove The Same Sequences In The Fasta Files?. I tried the following command with small data set that contain only 14 entries. It worked well with this file. But when I am trying to run this to original file it is not working, by giving all the sequencs instead of only unique sequence. (I know that my file contains around 40000 sequences out of which 1500 have similar nucleotide sequences)

what I tried is :

sed 's/$$^>.*$$/@\1#/' test_file | tr -d '\n' | tr "@" "\n" | tr "#" "\t" | sed '/^\$/d' | sort -u -k 2,2


Does anybody have idea what I did wrong ?

thanks

duplicate sequence removal multifasta • 2.8k views
2
Entering edit mode

It sounds weird that the command is working with a little subset but not with the biggest. Have you tried the other answers of the mentioned post? I'd say that fastx-toolkit solution is better than sed one for big datasets. sed can become quite slow.

For other hand, if you are still interested in doing the task with sed, I'd try to get a little subset of the result (where you still have the duplicates), and run the command again, and try to figure out what its going on. It's difficult to help you without knowing the input and the output.

1
Entering edit mode

I second that, I think the fastx toolkit is very suitable for this. Why not re-use tools? Some-one else has put effort and energy in developing this tool and you can use it for free! Don't waste your time reinventing the wheel I would say...

0
Entering edit mode
6.9 years ago

I will definitively use the dedupe.sh program included in the set of terrific bbmap utilities