Question: How to remove duplicate fasta sequences from a file
0
gravatar for Manoj
6 months ago by
Manoj30
Canada
Manoj30 wrote:

Hi, I have total 250 files which contain scaffolds in fasta format, however several scaffolds are duplicate sequences with different headers in between files. Therefore, I want compare these files and make a file of unique scaffolds sequences. Please see following example files:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG

File 2:

>NODE_250_length_56_cov_292 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_251_length_56_cov_157 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_252_length_56_cov_86 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

OUTPUT:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
alignment sequence assembly • 218 views
ADD COMMENTlink modified 6 months ago by finswimmer13k • written 6 months ago by Manoj30
2
gravatar for h.mon
6 months ago by
h.mon29k
Brazil
h.mon29k wrote:

Concatenate the fastas and use Dedupe or CD-HIT.

ADD COMMENTlink written 6 months ago by h.mon29k

I tried the following command:

cd-hit-454 -i /home/kumarm/CD-HIT/phy_D-scaffold-marge.fasta -o /home/kumarm/CD-HIT/454_reads_95 -c 0.99 -M 0 -T 7

error: Fatal Error: in diag_test_aapn_est, MAX_DIAG reached Program halted !!

ADD REPLYlink written 6 months ago by Manoj30
2
gravatar for finswimmer
6 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Use seqkit:

$ cat file1.fa file2.fa | seqkit rmdup -s -o out.fa
ADD COMMENTlink written 6 months ago by finswimmer13k

please let me know how to install seqkit?

ADD REPLYlink written 6 months ago by Manoj30

I strongly recommend using bioconda.

The first part of this tutorial by me, might be useful for you.

ADD REPLYlink written 6 months ago by finswimmer13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 866 users visited in the last hour