Question: How to remove duplicated sequences from multi fasta file?
1
gravatar for tcf.hcdg
2.8 years ago by
tcf.hcdg60
European Union
tcf.hcdg60 wrote:

Hello

I have a multifast file that contains similar nucleotide sequence with different identifier. I want to remove all those duplicated sequence (based on nucleotide rather than their identifier). I have searched similar problem on internet and also on biostar page How To Remove The Same Sequences In The Fasta Files?. I tried the following command with small data set that contain only 14 entries. It worked well with this file. But when I am trying to run this to original file it is not working, by giving all the sequencs instead of only unique sequence. (I know that my file contains around 40000 sequences out of which 1500 have similar nucleotide sequences)

what I tried is :

sed 's/\(^>.*$\)/@\1#/' test_file | tr -d '\n' | tr "@" "\n" | tr "#" "\t" | sed '/^$/d' | sort -u -k 2,2

Does anybody have idea what I did wrong ?

thanks

ADD COMMENTlink modified 2.8 years ago by Antonio R. Franco4.0k • written 2.8 years ago by tcf.hcdg60
2

It sounds weird that the command is working with a little subset but not with the biggest. Have you tried the other answers of the mentioned post? I'd say that fastx-toolkit solution is better than sed one for big datasets. sed can become quite slow.

For other hand, if you are still interested in doing the task with sed, I'd try to get a little subset of the result (where you still have the duplicates), and run the command again, and try to figure out what its going on. It's difficult to help you without knowing the input and the output.

ADD REPLYlink written 2.8 years ago by iraun3.5k
1

I second that, I think the fastx toolkit is very suitable for this. Why not re-use tools? Some-one else has put effort and energy in developing this tool and you can use it for free! Don't waste your time reinventing the wheel I would say...

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by b.nota6.2k
0
gravatar for Antonio R. Franco
2.8 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

I will definitively use the dedupe.sh program included in the set of terrific bbmap utilities

ADD COMMENTlink written 2.8 years ago by Antonio R. Franco4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 740 users visited in the last hour