Question: Remove duplicates in fasta file based on ID
1
gravatar for oliver.bayfield
3.1 years ago by
United States
oliver.bayfield100 wrote:

I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner ">name", which in my case is numeric (gi).

Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed? 

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

      >123456

      AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

There are no spaces between characters or lines in my file.

sort sed fasta • 6.9k views
ADD COMMENTlink modified 16 months ago by Eslam Samir100 • written 3.1 years ago by oliver.bayfield100
3
awk '/^>/{f=!d[$1];d[$1]=1}f' in.fa > out.fa
ADD REPLYlink written 3.1 years ago by lh330k

This solution is perfect, i would very appreciate a simple explanation for it.

ADD REPLYlink written 15 months ago by Caesar10
2

linerarize, sort using options -k1,1 -u , move back to fasta using tr

ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum108k
Is this correct?     

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t ' ' -f -k1,1 |\
sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by oliver.bayfield100

Update, the above (from Pierre Lindenbaum) does the job. Very good.
The only thing, I dropped the -t '  ' and -f flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of '>', which I manually deleted.

ADD REPLYlink written 3.1 years ago by oliver.bayfield100
3
gravatar for iraun
3.1 years ago by
iraun3.3k
Norway
iraun3.3k wrote:

Try this command:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa | awk '!seen[$1]++'

In the first part the fasta is converted to tabular format and in the second part the duplicated ID's are removed.

You'll need to transform to fasta format again, but it's not complicated ;)

 

ADD COMMENTlink written 3.1 years ago by iraun3.3k

airan thanks that seems to work on a test file - any chance you could guide me doing the conversion back to fasta? I'm rather inexperienced here. Thanks

ADD REPLYlink written 3.1 years ago by oliver.bayfield100
1

Extending airan solution for converting it to a fasta format:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$1]++' | awk -v OFS="\n" '{print $1,$2}'
ADD REPLYlink written 3.1 years ago by Varun Gupta1000
3
gravatar for oliver.bayfield
3.1 years ago by
United States
oliver.bayfield100 wrote:

Thanks for your comments. I went with the orignal suggestion of Pier's:

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -k1,1 | sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

The first line of the output contains a '>' only, which I deleted manually.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by oliver.bayfield100

please see: A: Best bioinfo one-liners? for a better way to linearize fasta. at the end you can convert the lines to fasta using ` tr "\t" "\n" `

 

 

 

ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum108k
2
gravatar for Varun Gupta
3.1 years ago by
Varun Gupta1000
United States
Varun Gupta1000 wrote:
This would work assuming you have sequence in one line and it is not split over multiple lines.
 
#!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
while(<>){
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
    }else{
        $id2seq{$key} = $_;
    }
}

foreach(keys %id2seq){
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD COMMENTlink written 3.1 years ago by Varun Gupta1000

For multi-line sequences, i think you can use $id2seq{$key} .= $_

ADD REPLYlink written 3.1 years ago by biolab1000

@biolab

It doesn't work if 2 same id's are adjacent some thing like this

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

>123456

AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

CCGG

IT will append it for the same header

ADD REPLYlink written 3.1 years ago by Varun Gupta1000

Hi, Varun Gupta, your point is very good. I further modified the script to allow multi-line sequences. If a header is duplicated, only the 1st is outputted.

#!/usr/bin/perl
use strict;
use warnings;

my (%id2seq, %seen);
my ($key, $duplicate);
while(<>) {
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
        if (exists $seen{$key}) {
            print STDERR "Attention: header $key duplicated.\n";
            $duplicate  = 1;
        } else {
            $seen{$key} = 1;
            $duplicate  = 0;
        }
    } else {
        ($duplicate == 1) ? (next) : ($id2seq{$key} .= $_);
    }
}

foreach(keys %id2seq) {
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by biolab1000
1
gravatar for tomc
3.1 years ago by
tomc70
United States
tomc70 wrote:

Assuming dup.fasta has one '>defline' followed by one line of sequence

awk '/^>/{id=$0;getline;arr[id]=$0}END{for(id in arr)printf("%s\n%s\n",id,arr[id])}' dup.fasta > uniq.fasta

there is no requirement duplicates be adjacent, there is no guarantee the output order is related to the input order.

 

 

 

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by tomc70
1
gravatar for Eslam Samir
16 months ago by
Eslam Samir100
Egypt / Cairo / Microbiology & Immunology Department
Eslam Samir100 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink written 16 months ago by Eslam Samir100
0
gravatar for anp375
3.1 years ago by
anp375140
anp375140 wrote:

if you use perl, you could put the sequences in a hash with the name/number as the key and the sequence as the value. then you could print it to a new file, or empty your file and print it back in. Every duplicate you come across will just replace the previous one.

ADD COMMENTlink written 3.1 years ago by anp375140

Could you give an example @anp375?

ADD REPLYlink written 3.1 years ago by oliver.bayfield100

sure, in a few hours though

ADD REPLYlink written 3.1 years ago by anp375140
1

use Bio::SeqIO;

use strict;

my $protein_fasta = "/Somefilepath/protein.txt";
my $protein_out = ">/Somefilepath/reduced.txt";

my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format =>'Fasta');# This is a constructor for SeqIO
# The filename specifies whether it is read or write
# The big arrows are called fat commas. They are just commas used for documenting. -format => 'Fasta' is
# just -format, 'Fasta' but the arrow shows the relationship for readability
my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'Fasta');# SeqIO out because of >


# Those constructors set up the filehandles. The method to get the sequences inside is called 'next_seq'
# It returns a generically formatted sequence rather than Fasta

my %seqs; # This is a hash. I'm assuming the name you are using is an accession number, so I'll make the keys accession numbers. Only one of each will be left.
while(my$seq = $seq_in -> next_seq){

    $seqs{$seq->accession_number} = $seq;
}

foreach(values %seqs){

$seq_out->write_seq($_);# write_seq converts the generic sequence to a Fasta sequence and writes
    # it to the file

}

 

I hope that works. If the number is not the accesion number, there is a list of other ids here: http://search.cpan.org/dist/BioPerl/Bio/Seq.pm#accession_number

ADD REPLYlink written 3.1 years ago by anp375140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1074 users visited in the last hour