Closed:How to remove duplicated sequences without using any software
2
0
Entering edit mode
4.0 years ago
Bioinfo ▴ 20

Hello

Please i have question i have contigs file tha i want to annotation using prokka but i get this error msg saying that contains duplicate sequence ID: scaffold36|size13034 it makes sense because i merge some assembly files and i eliminate duplication using cd-hit and seqkit and i think that they didn't the work perfectly

so what i need is eliminate duplication sequences 'manualy' (or using another software )

so basically whta i want to do is

i have file like this :

>scaffold1|size1334
ACTGATGATACAGATACAGAAAGTAGAGATCGATGATAGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold3|size11654
ATAGCGCTCGCGCGCCGCGCGGCGGGGTAGAGAGATCTTTTGAGAGAGA..
>scaffold4|size3034
TGGGGTAGAGAGAGAGAGAGAAGAGGAAGAGAGGAGAGAGGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold100|size304
AAAAAAATACAGATAGAGAGAGAGAGGAGAGAGAGAG..
>scaffold67|size2400
ATAGAGAGAGAGAGAGAGAGAGAGAGAGGAGAGAGAGAGA..

i want to eliminate the duplicated scaffold (in this case is scaffold 2 the line >scaffold2|size2304 and its sequence because is repeated two times

so the out put will be

>scaffold1|size1334
ACTGATGATACAGATACAGAAAGTAGAGATCGATGATAGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold3|size11654
ATAGCGCTCGCGCGCCGCGCGGCGGGGTAGAGAGATCTTTTGAGAGAGA..
>scaffold4|size3034
TGGGGTAGAGAGAGAGAGAGAAGAGGAAGAGAGGAGAGAGGA..
>scaffold100|size304
AAAAAAATACAGATAGAGAGAGAGAGGAGAGAGAGAG..
>scaffold67|size2400
ATAGAGAGAGAGAGAGAGAGAGAGAGAGGAGAGAGAGAGA.

.

each scaffold is repeated just one time Thank you

alignment Assembly sequencing sequence • 215 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6