Question: How to remove duplicate fasta sequences from a file
0
gravatar for Manoj
5 weeks ago by
Manoj30
Canada
Manoj30 wrote:

Hi, I have total 250 files which contain scaffolds in fasta format, however several scaffolds are duplicate sequences with different headers in between files. Therefore, I want compare these files and make a file of unique scaffolds sequences. Please see following example files:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG

File 2:

>NODE_250_length_56_cov_292 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_251_length_56_cov_157 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_252_length_56_cov_86 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

OUTPUT:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
alignment sequence assembly • 126 views
ADD COMMENTlink modified 5 weeks ago by finswimmer12k • written 5 weeks ago by Manoj30
2
gravatar for h.mon
5 weeks ago by
h.mon27k
Brazil
h.mon27k wrote:

Concatenate the fastas and use Dedupe or CD-HIT.

ADD COMMENTlink written 5 weeks ago by h.mon27k

I tried the following command:

cd-hit-454 -i /home/kumarm/CD-HIT/phy_D-scaffold-marge.fasta -o /home/kumarm/CD-HIT/454_reads_95 -c 0.99 -M 0 -T 7

error: Fatal Error: in diag_test_aapn_est, MAX_DIAG reached Program halted !!

ADD REPLYlink written 5 weeks ago by Manoj30
2
gravatar for finswimmer
5 weeks ago by
finswimmer12k
Germany
finswimmer12k wrote:

Use seqkit:

$ cat file1.fa file2.fa | seqkit rmdup -s -o out.fa
ADD COMMENTlink written 5 weeks ago by finswimmer12k

please let me know how to install seqkit?

ADD REPLYlink written 5 weeks ago by Manoj30

I strongly recommend using bioconda.

The first part of this tutorial by me, might be useful for you.

ADD REPLYlink written 4 weeks ago by finswimmer12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour