Question

How to remove duplicate sequences in fasta file using python?

1

Entering edit mode

6.3 years ago

horsedog ▴ 60

I have a two fasta files, file 1 and file 2 ,they have a lot of overlapped sequences but not all of them, here I want to merge these two files into one file file 3 and remove the duplicate part, just keeping the unique one, is there code example for python use? Well the duplicate here means the exact same query name and sequences, like these two:

>YP_204112.2
MEHYISLLVKSIFIENMALSFFLGMCTFLAVSKKVKTSFGLGVAVVVVLTIAVPVNNLVYTYLLKENALV
AGVDLTFLSFITFIGVIAALVQILEMILDRFFPPLYNALGIFLPLITVNCAIFGGVSFMVQRDYNFAESV
VYGFGSGIGWMLAIVALAGIREKMKYSDVPPGLRGLGITFITVGLMALGFMSFSGVQL

>YP_204112.2
MEHYISLLVKSIFIENMALSFFLGMCTFLAVSKKVKTSFGLGVAVVVVLTIAVPVNNLVYTYLLKENALV
AGVDLTFLSFITFIGVIAALVQILEMILDRFFPPLYNALGIFLPLITVNCAIFGGVSFMVQRDYNFAESV
VYGFGSGIGWMLAIVALAGIREKMKYSDVPPGLRGLGITFITVGLMALGFMSFSGVQL

Many thanks!

python • 11k views

ADD COMMENT • link updated 5.5 years ago by michau ▴ 60 • written 6.3 years ago by horsedog ▴ 60

1

Entering edit mode

Is there a reason you seem to want a python solution everytime? CD-HIT is meant for this sort of application.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

Well thank you very much! I'll take a look into CDHIT! no reason just I'm practising python recently so I would like to see how people solve problem by python.

ADD REPLY • link 6.3 years ago by horsedog ▴ 60

score 3 · Answer 1 · 2018-01-07

3

Entering edit mode

6.3 years ago

Alex Reynolds 35k

Shorter option via awk:

$ cat one.fa two.fa | awk -vRS=">" '!a[$0]++ { print ">"$0; }' - > answer.fa

If you want to replicate this with Python, you could look at using a dictionary to store unique keys.

ADD COMMENT • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

hi , it said "invalid -v"

ADD REPLY • link 6.3 years ago by horsedog ▴ 60

0

Entering edit mode

Are you using GNU awk? If you're using OS X, install GNU awk via Homebrew: brew install gawk

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi again, well you're right it works now, but it's so weird cuz in the "answer.fa" it contains even more lines than file1 plus file2, which means that seems it didn't 'remove' but 'add'? (I used wc -l to count lines)

ADD REPLY • link 6.3 years ago by horsedog ▴ 60

0

Entering edit mode

Maybe grab the first few lines of both files and try it out on those test files. I'm not sure why you get that result; this is a pretty common use of awk.

ADD REPLY • link 6.3 years ago by Alex Reynolds 35k

0

Entering edit mode

I like the simplicity! This method almost worked for me, but was printing empty lines and one line with only '>'. I'm not an expert in awk, so filtering results by using grep:

cat one.fa two.fa | awk -v RS=">" '!a[$0]++ { print ">"$0; }' - | grep -Ev '^\s*$|^>\s*$' > answer.fa

ADD REPLY • link 5.9 years ago by souzademedeiros • 0

score 2 · Answer 2 · 2018-01-26

2

Entering edit mode

6.3 years ago

tiago211287 ★ 1.4k

This task can be accomplished with FASTA/Q Collapser quickly.

ADD COMMENT • link 6.3 years ago by tiago211287 ★ 1.4k

score 2 · Answer 3 · 2018-01-27

2

Entering edit mode

6.3 years ago

lakhujanivijay 5.8k

One liner using seqkit

zcat fasta.fa.gz | seqkit rmdup -s -i -m -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt

Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQ, Tutorial, Benchmark and Development Notes)
Source code: https://github.com/shenwei356/seqkit

ADD COMMENT • link 6.3 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Fast and do the job nicely. The inspection of the duplicated is very helpful. Best option for me so far.

ADD REPLY • link 4.8 years ago by gildas.lepennetier ▴ 10

score 2 · Answer 4 · 2018-11-15

2

Entering edit mode

5.5 years ago

michau ▴ 60

Learn to use Biopython library. It's handy as hell. You can use any format as in/out

from Bio import SeqIO

with open('output.fasta', 'a') as outFile:
    record_ids = list()
    for record in SeqIO.parse('input.fasta', 'fasta'):
        if record.id not in record_ids:
            record_ids.append( record.id)
            SeqIO.write(record, outFile, 'fasta')

ADD COMMENT • link 5.5 years ago by michau ▴ 60

2

Entering edit mode

It should be noted this only checks for duplicates based on their IDs, which is not always going to be the most robust way to do it. It would probably be best to check for duplicated sequences which is what most of the other solutions are doing.

ADD REPLY • link 5.5 years ago by Joe 21k

score 1 · Answer 5 · 2018-04-15

1

Entering edit mode

6.0 years ago

pauley-p ▴ 10

This is a python based alternative to your issue that uses BioPy

https://github.com/MJChimal/BiologPy/blob/master/drop_unique_records.py

Hope it still helps! :)

ADD COMMENT • link 6.0 years ago by pauley-p ▴ 10

score 0 · Answer 6 · 2018-01-26

This python code will combine all the input fasta files (any number of files), and output 1 file with duplicated sequences removed. Note: if your input fasta files are too big to load into memory, this code will fail.

input_files=[list of input file names]
output_file='output_file_name'

holder=[]
for file in input_files:
    with open(file,'r') as file:
        rec=file.read().split('>')[1:]
        rec=['>'+i.strip()+'\n' for i in rec]
    holder.extend(rec)

total='\n'.join(list(set(holder)))

with open(output_file,'w') as out:
    out.write(total)