Remove Duplicates In Fasta (Protein Seq.)
1
0
Entering edit mode
10.7 years ago
IsmailM ▴ 110

Hi, I am trying to remove duplicates entries in a fasta file (containing protein sequences). I have looked for similar posts and have tried some of their fixes, but options like faidx toolkit only work with nucleotides fasta files and as of yet I haven't been able to find something that actually works.

At the moment, I extract the seq. IDs and run the following script:

 #!/usr/bin/env ruby

filename = ARGV.first 
text = File.read(filename)
entryid = />\S+/i

text.scan(entryid).uniq.sort.each  do |output|
    puts output.chomp
end

This removes all duplicate entries. I then use samtools to extract the sequence from the original file to give me a much smaller file with the seq IDs and the respective sequence.

However this is too slow - when running with large files - it has taken 4 hours and is still running.

Is there any alternative method that is a little faster..

EDIT:

I am dealing with 5 species that I run this on. I am using a batch type file to run the scripts. The largest file consists of 332,369 sequences which only contains 72,552 unique sequences.

Secondly, I would be grateful if you type any scripts in ruby since that is the programming language that I can just about get my head around at the moment - I'm a total beginner in programming/ bioinformatics.

Many Thanks
Ismail

fasta duplicates • 6.1k views
ADD COMMENT
0
Entering edit mode

How many protein entries are there in the file? How big is the file? You want to remove duplicates if two entries have same header or same sequences? Personally I believe a simple python or perl script should get you the result within a minute.

ADD REPLY
0
Entering edit mode

replied by editing main post

ADD REPLY
0
Entering edit mode

If you only care about having unique sequence & don't mind losing the IDs, you could run this: sed -n '2~2p' file.fasta | sort -u. If you want to see how often a sequence appeared, replace sort -u with sort | uniq -c.

ADD REPLY
1
Entering edit mode
10.7 years ago
Rm 8.3k

If comparisons are at the sequence level: you can use CD-HIT or uclust at a given sequence identity cutoff

ADD COMMENT

Login before adding your answer.

Traffic: 1470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6