Question: Duplicate_SeqIds Blast database
0
gravatar for sukesh1411
3.1 years ago by
sukesh141130
South Africa
sukesh141130 wrote:

Hi

I could not create a blast database for nucleotide i.e nt text file which has sequences in fasta format.

because Error: Duplicate seq_ids are found.

How can remove this dup seq_ids. Can anyone help me on this

blast • 1.2k views
ADD COMMENTlink written 3.1 years ago by sukesh141130
1
gravatar for Prasad
3.1 years ago by
Prasad1.6k
India
Prasad1.6k wrote:

you can go through the How To Remove The Same Sequences In The Fasta Files?

using cd-hit or uclust tools (with 100% identity and coverage cutoff) you can remove the duplicates.

ADD COMMENTlink written 3.1 years ago by Prasad1.6k

I am trying with uclust tools. The nt file which i downloaded from blast database is text file which has sequences in fasta format. This text format is not accepted by uclust tools. How can i convert text to fasta format.

ADD REPLYlink written 3.1 years ago by sukesh141130

can you post few lines as an example from you text file

ADD REPLYlink written 3.1 years ago by Prasad1.6k

gi|4|emb|X17276.1| Giant Panda satellite 1 DNA GATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT GGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTC ACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTC CGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCT GACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGAT GGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTA TTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG gi|7|emb|X51700.1| Bos taurus mRNA for bone Gla protein GTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGCCCTGCTGGCCCTGGCCACACTCTGCCTCGC TGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG TGGTGAAGAGACTCAGGCGCTACCTGGACCACTGGCTGGGAGCCCCAGCCCCCTACCCAGATCCGCTGGAGCCCAAGAGG GAGGTGTGTGAGCTCAACCCTGACTGTGACGAGCTAGCTGACCACATCGGCTTCCAGGAAGCCTATCGGCGCTTCTACGG CCCAGTCTAGAGCTTGCAGCCCTGCCCACCTGGCTGGCAGCCCCCAGCTCTGGCTTCTCTCCAGGACCCCTCCCCTCCCC GTCATCCCCGCTGCTCTAGAATAAACTCCAGAAGAGG

ADD REPLYlink written 3.1 years ago by sukesh141130

Just > is missing from the header line. If your file is small you can just replace gi| with >gi|. If the file is huge use any code

perl -ne '{if ($_=~/^gi/){print ">",$_;}else{print}}'  input_file >out_file
ADD REPLYlink written 3.1 years ago by Prasad1.6k

the same question you have already posted here. It seems you have a fasta file. I assume you have got the answer for this question

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Prasad1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour