Question: Remove duplicates in fasta file based on ID
1
gravatar for oliver.bayfield
2.3 years ago by
United States
oliver.bayfield90 wrote:

I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner ">name", which in my case is numeric (gi).

Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed? 

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

There are no spaces between characters or lines in my file.

sort sed fasta • 5.2k views
ADD COMMENTlink modified 7 months ago by Eslam Samir90 • written 2.3 years ago by oliver.bayfield90
3
awk '/^>/{f=!d[$1];d[$1]=1}f' in.fa > out.fa
ADD REPLYlink written 2.3 years ago by lh330k

This solution is perfect, i would very appreciate a simple explanation for it.

ADD REPLYlink written 6 months ago by Caesar10
2

linerarize, sort using options -k1,1 -u , move back to fasta using tr

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum98k
Is this correct?     

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t ' ' -f -k1,1 |\
sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by oliver.bayfield90

Update, the above (from Pierre Lindenbaum) does the job. Very good.
The only thing, I dropped the -t '  ' and -f flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of '>', which I manually deleted.

ADD REPLYlink written 2.3 years ago by oliver.bayfield90
3
gravatar for iraun
2.3 years ago by
iraun3.1k
Norway
iraun3.1k wrote:

Try this command:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa | awk '!seen[$1]++'

In the first part the fasta is converted to tabular format and in the second part the duplicated ID's are removed.

You'll need to transform to fasta format again, but it's not complicated ;)

 

ADD COMMENTlink written 2.3 years ago by iraun3.1k

airan thanks that seems to work on a test file - any chance you could guide me doing the conversion back to fasta? I'm rather inexperienced here. Thanks

ADD REPLYlink written 2.3 years ago by oliver.bayfield90
1

Extending airan solution for converting it to a fasta format:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$1]++' | awk -v OFS="\n" '{print $1,$2}'
ADD REPLYlink written 2.3 years ago by Varun Gupta870
2
gravatar for Varun Gupta
2.3 years ago by
Varun Gupta870
United States
Varun Gupta870 wrote:
This would work assuming you have sequence in one line and it is not split over multiple lines.
 
#!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
while(<>){
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
    }else{
        $id2seq{$key} = $_;
    }
}

foreach(keys %id2seq){
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD COMMENTlink written 2.3 years ago by Varun Gupta870

For multi-line sequences, i think you can use $id2seq{$key} .= $_

ADD REPLYlink written 2.3 years ago by biolab950

@biolab

It doesn't work if 2 same id's are adjacent some thing like this

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

IT will append it for the same header

ADD REPLYlink written 2.3 years ago by Varun Gupta870

Hi, Varun Gupta, your point is very good. I further modified the script to allow multi-line sequences. If a header is duplicated, only the 1st is outputted.

#!/usr/bin/perl
use strict;
use warnings;

my (%id2seq, %seen);
my ($key, $duplicate);
while(<>) {
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
        if (exists $seen{$key}) {
            print STDERR "Attention: header $key duplicated.\n";
            $duplicate  = 1;
        } else {
            $seen{$key} = 1;
            $duplicate  = 0;
        }
    } else {
        ($duplicate == 1) ? (next) : ($id2seq{$key} .= $_);
    }
}

foreach(keys %id2seq) {
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by biolab950
2
gravatar for oliver.bayfield
2.3 years ago by
United States
oliver.bayfield90 wrote:

Thanks for your comments. I went with the orignal suggestion of Pier's:

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -k1,1 | sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

The first line of the output contains a '>' only, which I deleted manually.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by oliver.bayfield90

please see: A: Best bioinfo one-liners? for a better way to linearize fasta. at the end you can convert the lines to fasta using ` tr "\t" "\n" `

 

 

 

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum98k
1
gravatar for tomc
2.3 years ago by
tomc60
United States
tomc60 wrote:

Assuming dup.fasta has one '>defline' followed by one line of sequence

awk '/^>/{id=$0;getline;arr[id]=$0}END{for(id in arr)printf("%s\n%s\n",id,arr[id])}' dup.fasta > uniq.fasta

there is no requirement duplicates be adjacent, there is no guarantee the output order is related to the input order.

 

 

 

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by tomc60
1
gravatar for Eslam Samir
7 months ago by
Eslam Samir90
Egypt / Cairo / Microbiology & Immunology Department
Eslam Samir90 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink written 7 months ago by Eslam Samir90
0
gravatar for anp375
2.3 years ago by
anp37580
anp37580 wrote:

if you use perl, you could put the sequences in a hash with the name/number as the key and the sequence as the value. then you could print it to a new file, or empty your file and print it back in. Every duplicate you come across will just replace the previous one.

ADD COMMENTlink written 2.3 years ago by anp37580

Could you give an example @anp375?

ADD REPLYlink written 2.3 years ago by oliver.bayfield90

sure, in a few hours though

ADD REPLYlink written 2.3 years ago by anp37580
1

use Bio::SeqIO;

use strict;

my $protein_fasta = "/Somefilepath/protein.txt";
my $protein_out = ">/Somefilepath/reduced.txt";

my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format =>'Fasta');# This is a constructor for SeqIO
# The filename specifies whether it is read or write
# The big arrows are called fat commas. They are just commas used for documenting. -format => 'Fasta' is
# just -format, 'Fasta' but the arrow shows the relationship for readability
my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'Fasta');# SeqIO out because of >


# Those constructors set up the filehandles. The method to get the sequences inside is called 'next_seq'
# It returns a generically formatted sequence rather than Fasta

my %seqs; # This is a hash. I'm assuming the name you are using is an accession number, so I'll make the keys accession numbers. Only one of each will be left.
while(my$seq = $seq_in -> next_seq){

    $seqs{$seq->accession_number} = $seq;
}

foreach(values %seqs){

$seq_out->write_seq($_);# write_seq converts the generic sequence to a Fasta sequence and writes
    # it to the file

}

 

I hope that works. If the number is not the accesion number, there is a list of other ids here: http://search.cpan.org/dist/BioPerl/Bio/Seq.pm#accession_number

ADD REPLYlink written 2.3 years ago by anp37580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 562 users visited in the last hour