Question: How to remove duplicate sequences in fasta file using python?
0
gravatar for horsedog
5 months ago by
horsedog30
horsedog30 wrote:

I have a two fasta files, file 1 and file 2 ,they have a lot of overlapped sequences but not all of them, here I want to merge these two files into one file file 3 and remove the duplicate part, just keeping the unique one, is there code example for python use? Well the duplicate here means the exact same query name and sequences, like these two:

>YP_204112.2
MEHYISLLVKSIFIENMALSFFLGMCTFLAVSKKVKTSFGLGVAVVVVLTIAVPVNNLVYTYLLKENALV
AGVDLTFLSFITFIGVIAALVQILEMILDRFFPPLYNALGIFLPLITVNCAIFGGVSFMVQRDYNFAESV
VYGFGSGIGWMLAIVALAGIREKMKYSDVPPGLRGLGITFITVGLMALGFMSFSGVQL

>YP_204112.2
MEHYISLLVKSIFIENMALSFFLGMCTFLAVSKKVKTSFGLGVAVVVVLTIAVPVNNLVYTYLLKENALV
AGVDLTFLSFITFIGVIAALVQILEMILDRFFPPLYNALGIFLPLITVNCAIFGGVSFMVQRDYNFAESV
VYGFGSGIGWMLAIVALAGIREKMKYSDVPPGLRGLGITFITVGLMALGFMSFSGVQL

Many thanks!

python • 902 views
ADD COMMENTlink modified 10 weeks ago by pauley-p10 • written 5 months ago by horsedog30
1

Is there a reason you seem to want a python solution everytime? CD-HIT is meant for this sort of application.

ADD REPLYlink written 5 months ago by genomax50k

Well thank you very much! I'll take a look into CDHIT! no reason just I'm practising python recently so I would like to see how people solve problem by python.

ADD REPLYlink written 5 months ago by horsedog30
2
gravatar for tiago211287
4 months ago by
tiago211287990
USA
tiago211287990 wrote:

This task can be accomplished with FASTA/Q Collapser quickly.

ADD COMMENTlink modified 4 months ago • written 4 months ago by tiago211287990
1
gravatar for Alex Reynolds
5 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Shorter option via awk:

$ cat one.fa two.fa | awk -vRS=">" '!a[$0]++ { print ">"$0; }' - > answer.fa

If you want to replicate this with Python, you could look at using a dictionary to store unique keys.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Alex Reynolds24k

hi , it said "invalid -v"

ADD REPLYlink written 5 months ago by horsedog30

Are you using GNU awk? If you're using OS X, install GNU awk via Homebrew: brew install gawk

ADD REPLYlink written 5 months ago by Alex Reynolds24k

Hi again, well you're right it works now, but it's so weird cuz in the "answer.fa" it contains even more lines than file1 plus file2, which means that seems it didn't 'remove' but 'add'? (I used wc -l to count lines)

ADD REPLYlink written 5 months ago by horsedog30

Maybe grab the first few lines of both files and try it out on those test files. I'm not sure why you get that result; this is a pretty common use of awk.

ADD REPLYlink written 5 months ago by Alex Reynolds24k

I like the simplicity! This method almost worked for me, but was printing empty lines and one line with only '>'. I'm not an expert in awk, so filtering results by using grep:

cat one.fa two.fa | awk -v RS=">" '!a[$0]++ { print ">"$0; }' - | grep -Ev '^\s*$|^>\s*$' > answer.fa
ADD REPLYlink modified 19 days ago • written 19 days ago by souzademedeiros0
1
gravatar for Vijay Lakhujani
4 months ago by
Vijay Lakhujani2.6k
India
Vijay Lakhujani2.6k wrote:

One liner using seqkit

zcat fasta.fa.gz | seqkit rmdup -s -i -m -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt
ADD COMMENTlink written 4 months ago by Vijay Lakhujani2.6k
1
gravatar for pauley-p
10 weeks ago by
pauley-p10
UNAM
pauley-p10 wrote:

This is a python based alternative to your issue that uses BioPy

https://github.com/MJChimal/BiologPy/blob/master/drop_unique_records.py

Hope it still helps! :)

ADD COMMENTlink written 10 weeks ago by pauley-p10
0
gravatar for shoujun.gu
4 months ago by
shoujun.gu340
Rockville/MD
shoujun.gu340 wrote:

This python code will combine all the input fasta files (any number of files), and output 1 file with duplicated sequences removed. Note: if your input fasta files are too big to load into memory, this code will fail.

input_files=[list of input file names]
output_file='output_file_name'

holder=[]
for file in input_files:
    with open(file,'r') as file:
        rec=file.read().split('>')[1:]
        rec=['>'+i.strip()+'\n' for i in rec]
    holder.extend(rec)

total='\n'.join(list(set(holder)))

with open(output_file,'w') as out:
    out.write(total)
ADD COMMENTlink modified 4 months ago • written 4 months ago by shoujun.gu340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1574 users visited in the last hour