Question: Remove duplicates in fasta file based on ID
7
gravatar for oliver.bayfield
5.5 years ago by
United States
oliver.bayfield160 wrote:

I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner ">name", which in my case is numeric (gi).

Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed? 

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

There are no spaces between characters or lines in my file.

sort sed fasta • 16k views
ADD COMMENTlink modified 2.0 years ago by michau50 • written 5.5 years ago by oliver.bayfield160
2

linerarize, sort using options -k1,1 -u , move back to fasta using tr

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum131k
Is this correct?     

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t ' ' -f -k1,1 |\
sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by oliver.bayfield160

Update, the above (from Pierre Lindenbaum) does the job. Very good.
The only thing, I dropped the -t '  ' and -f flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of '>', which I manually deleted.

ADD REPLYlink written 5.5 years ago by oliver.bayfield160
17
gravatar for lh3
5.5 years ago by
lh332k
United States
lh332k wrote:
awk '/^>/{f=!d[$1];d[$1]=1}f' in.fa > out.fa
ADD COMMENTlink written 5.5 years ago by lh332k

This solution is perfect, i would very appreciate a simple explanation for it.

ADD REPLYlink written 3.8 years ago by Caesar10

explanation /^>/: when matches line start with > do this block {f=!d[$1];d[$1]=1}. f=!d[$1]: f is false only if sequence name does exists in d[$1], and keeps being false until new sequence. d[$1]=1: register sequence name; f is true, print line.

ADD REPLYlink written 7 weeks ago by biocyberman810
5
gravatar for iraun
5.5 years ago by
iraun3.8k
Norway
iraun3.8k wrote:

Try this command:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa | awk '!seen[$1]++'

In the first part the fasta is converted to tabular format and in the second part the duplicated ID's are removed.

You'll need to transform to fasta format again, but it's not complicated ;)

 

ADD COMMENTlink written 5.5 years ago by iraun3.8k

airan thanks that seems to work on a test file - any chance you could guide me doing the conversion back to fasta? I'm rather inexperienced here. Thanks

ADD REPLYlink written 5.5 years ago by oliver.bayfield160
2

Extending airan solution for converting it to a fasta format:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$1]++' | awk -v OFS="\n" '{print $1,$2}'
ADD REPLYlink written 5.5 years ago by Varun Gupta1.2k

And in case your headers have spaces and you also want to consider the sequence

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$0]++' | awk -v OFS="\n" '{for(i=2;i<NF;i++) head = head " " $i; print $1 " " head,$NF; head = ""}'
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by marcosmorgan110
3
gravatar for oliver.bayfield
5.5 years ago by
United States
oliver.bayfield160 wrote:

Thanks for your comments. I went with the orignal suggestion of Pier's:

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -k1,1 | sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

The first line of the output contains a '>' only, which I deleted manually.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by oliver.bayfield160

please see: A: Best bioinfo one-liners? for a better way to linearize fasta. at the end you can convert the lines to fasta using ` tr "\t" "\n" `

 

 

 

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum131k
2
gravatar for Varun Gupta
5.5 years ago by
Varun Gupta1.2k
United States
Varun Gupta1.2k wrote:
This would work assuming you have sequence in one line and it is not split over multiple lines.
 
#!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
while(<>){
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
    }else{
        $id2seq{$key} = $_;
    }
}

foreach(keys %id2seq){
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD COMMENTlink written 5.5 years ago by Varun Gupta1.2k

For multi-line sequences, i think you can use $id2seq{$key} .= $_

ADD REPLYlink written 5.5 years ago by biolab1.2k

@biolab

It doesn't work if 2 same id's are adjacent some thing like this

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

IT will append it for the same header

ADD REPLYlink written 5.5 years ago by Varun Gupta1.2k

Hi, Varun Gupta, your point is very good. I further modified the script to allow multi-line sequences. If a header is duplicated, only the 1st is outputted.

#!/usr/bin/perl
use strict;
use warnings;

my (%id2seq, %seen);
my ($key, $duplicate);
while(<>) {
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
        if (exists $seen{$key}) {
            print STDERR "Attention: header $key duplicated.\n";
            $duplicate  = 1;
        } else {
            $seen{$key} = 1;
            $duplicate  = 0;
        }
    } else {
        ($duplicate == 1) ? (next) : ($id2seq{$key} .= $_);
    }
}

foreach(keys %id2seq) {
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by biolab1.2k
2
gravatar for tomc
5.5 years ago by
tomc80
United States
tomc80 wrote:

Assuming dup.fasta has one '>defline' followed by one line of sequence

awk '/^>/{id=$0;getline;arr[id]=$0}END{for(id in arr)printf("%s\n%s\n",id,arr[id])}' dup.fasta > uniq.fasta

there is no requirement duplicates be adjacent, there is no guarantee the output order is related to the input order.

 

 

 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by tomc80
1
gravatar for Eslam Samir
3.8 years ago by
Eslam Samir100
Germany / Würzburg / Universität Würzburg (IMIB)
Eslam Samir100 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink written 3.8 years ago by Eslam Samir100
0
gravatar for anp375
5.5 years ago by
anp375170
anp375170 wrote:

if you use perl, you could put the sequences in a hash with the name/number as the key and the sequence as the value. then you could print it to a new file, or empty your file and print it back in. Every duplicate you come across will just replace the previous one.

ADD COMMENTlink written 5.5 years ago by anp375170

Could you give an example @anp375?

ADD REPLYlink written 5.5 years ago by oliver.bayfield160

sure, in a few hours though

ADD REPLYlink written 5.5 years ago by anp375170
1

use Bio::SeqIO;

use strict;

my $protein_fasta = "/Somefilepath/protein.txt";
my $protein_out = ">/Somefilepath/reduced.txt";

my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format =>'Fasta');# This is a constructor for SeqIO
# The filename specifies whether it is read or write
# The big arrows are called fat commas. They are just commas used for documenting. -format => 'Fasta' is
# just -format, 'Fasta' but the arrow shows the relationship for readability
my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'Fasta');# SeqIO out because of >


# Those constructors set up the filehandles. The method to get the sequences inside is called 'next_seq'
# It returns a generically formatted sequence rather than Fasta

my %seqs; # This is a hash. I'm assuming the name you are using is an accession number, so I'll make the keys accession numbers. Only one of each will be left.
while(my$seq = $seq_in -> next_seq){

    $seqs{$seq->accession_number} = $seq;
}

foreach(values %seqs){

$seq_out->write_seq($_);# write_seq converts the generic sequence to a Fasta sequence and writes
    # it to the file

}

 

I hope that works. If the number is not the accesion number, there is a list of other ids here: http://search.cpan.org/dist/BioPerl/Bio/Seq.pm#accession_number

ADD REPLYlink written 5.5 years ago by anp375170
0
gravatar for michau
2.0 years ago by
michau50
Germany/Berlin/MPIMG
michau50 wrote:

Copied from my other post: A: How to remove duplicate sequences in fasta file using python?

Learn to use Biopython library. It's handy as hell. You can use any format as in/out

from Bio import SeqIO

with open('output.fasta', 'a') as outFile:
    record_ids = list()\
    for record in SeqIO.parse('input.fasta', 'fasta'):
        if record.id not in record_ids:
            record_ids.append( record.id )
            SeqIO.write(record, outFile, 'fasta')
  
ADD COMMENTlink modified 23 months ago • written 2.0 years ago by michau50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2081 users visited in the last hour