Remove duplicates in two databases
1
0
Entering edit mode
8.3 years ago
boudica5 • 0

Hello. I have two databases (.fas) that I have to unify in only one. Some sequences are present in both DB, I want to keep one copy and delete the other. Instead of removing manually one by one, is there any way to select all duplicates and remove them? Thank you very much.

sequences assembly database • 1.7k views
ADD COMMENT
1
Entering edit mode

'.fas' = fasta file ? (= just a file, not a database)

ADD REPLY
0
Entering edit mode

Yes, I have two files

ADD REPLY
1
Entering edit mode
8.3 years ago
GenoMax 142k

Dedupe.sh from BBMap should be able to do this. It will accept fasta files. Input is expected to be DNA.

ADD COMMENT
0
Entering edit mode

Thank you, do you know if it accepts fasta files with protein sequences?

Also, citing Brian Bushnell from the link you gave me "However, I do have another program, filterbyname.sh, that can remove all sequences from a file that either share or don't share names with sequences in another file", again do you know if it accepts protein sequences?

My best regards

ADD REPLY
1
Entering edit mode

Dedupe strictly works on DNA (or RNA) sequences. However, filterbyname.sh should work on protein sequences if you use the aminoin flag... I've just never tested it. Note that, of course, it will only address the situations where sequence share identical or similar names (depending on the mode); it ignores the sequence itself.

ADD REPLY
0
Entering edit mode

I´ll try it! Thank you all for your kind responses

ADD REPLY
0
Entering edit mode

In that case you will need to use one of the solutions indicated by Pierre (though most of those appear to be for DNA too). BBMap tools are for DNA sequences AFAIK. I have edited my original post.

ADD REPLY

Login before adding your answer.

Traffic: 1330 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6