FASTA sequence sorting based on the header
5
5
Entering edit mode
7.4 years ago
Eva_Maria ▴ 190

Hello I have a Fasta file which contains more than 10 lakh sequences

example

>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGGTGAATCTGGTGCAGCACTTA

based on the protein name (example protein=putative lipopolysaccharide biosynthesis glycosyltransferase) i want to sort all the sequences , similar protein name one by one

Is there any AWK or PERL programme s there for it

awk perl fasta sequence • 13k views
ADD COMMENT
3
Entering edit mode
7.4 years ago
michael.ante ★ 3.8k

I can only recommend using the FAST Analysis of Sequences Toolbox (FAST). It has a sorting function called 'fassort'. It is based on bioperl and was for my purposes always very fast.

$ man fassort
NAME
       fassort - sort sequences based on identifiers

SYNOPSIS
       fassort [OPTION]... [MULTIFASTA-FILE]...

DESCRIPTION
       fassort takes sequence or alignment data as input, and outputs sequence records sorted by some user-defined criteria. By default, sequences are sorted ASCIIbetically by their identifiers. Optionally sequences may be sorted by their sequences, descriptions or components of descripions.
 ...
ADD COMMENT
2
Entering edit mode
7.4 years ago

Assuming all your inputs are single-line FASTA (header on one line, and sequence on the next) and that you have the protein name in the same position in the header:

$ awk ' \
    BEGIN { \
        FS = "\\[|=|\\]"; \
        idx = 1; \
    } \
    { \
        if (NR % 2) { \
            header = $0; \
            protein = $3; \
        } \
        else { \
            sequence = $0; \
            headers[protein] = header; \
            sequences[protein] = sequence; \
            proteins[idx++] = protein; \
        } \
    } \
    END { \
        n = asort(proteins); \
        for (i = 1; i <= n; i++) { \
            print headers[proteins[i]]"\n"sequences[proteins[i]]; \
        } \
    }' records.fa > sorted_records.fa

If you have multiline FASTA and want to convert it to single-line form, you could take a look at the answers here: Multiline Fasta To Single Line Fasta

Edit: If your input has headers with duplicate protein names, I added an answer with a Perl script that does the same thing as the awk script, but handles duplicate names.

ADD COMMENT
0
Entering edit mode

it is running perfectly but out put is coming with out header

ADD REPLY
0
Entering edit mode

Are you working with a Windows-sourced text file with weird line endings, maybe? Is your input a true single-line FASTA file? Might be worth it to double-check your inputs. Perhaps troubleshoot the value of header in the else block with a print header; statement.

ADD REPLY
0
Entering edit mode

After changing header into single-line form the code is running perfectly

thank you Alex

ADD REPLY
0
Entering edit mode

Just curious, do you have FASTA records where there could be duplicates of protein names?

ADD REPLY
0
Entering edit mode

yes, obviously. therefore storing records with map/hash/dict is not the best way.

ADD REPLY
1
Entering edit mode

There are a number of ways to deal with collisions in hash tables that all work, but awk is probably not the best tool for that, which is why I ask.

ADD REPLY
2
Entering edit mode
7.4 years ago

Try SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Just download the binary executable files and run.

Fake sequences:

$ cat seqs.fa
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGGTGAATCTGGTGCAGCACTTA
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

Just ONE command with SeqKit. By default, seqkit sort sorts sequences by sequence identifier. For this case, we use --id-regexp "protein=(.+?)\]" to specify sequence IDs.

$ seqkit sort --id-regexp "protein=(.+?)\]" seqs.fa 
[INFO] read sequences ...
[INFO] 3 sequences loaded
[INFO] sorting ...
[INFO] output ...
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
cccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttt
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGG
TGAATCTGGTGCAGCACTTA

If you have large number of sequences, use --two-pass mode which has lower memory usage:

$ seqkit sort --id-regexp "protein=(.+?)\]" --two-pass seqs.fa

See more detail of seqkit sort.

------------[update]---------------

Protein names could by duplicated, for example, "hypothetical protein". Therefore, storing records with map/hash/dict is not the best way, which needs unique keys. You need to create new unique names for sorting and keep the original names for output.

A better method is:

  1. Converting FASTA format to tab-delimited table.
  2. Creating a new column containing the protein names.
  3. Sorting by the new column with sort or other command
  4. Converting back to FASTA format.

Here's an implemention combining SeqKit and csvtk.

FASTA format was 1) converted to 3-column tabular format (header, sequence, quality (blank for FASTA)), and 2) protein names in FASTA header column (1th) was extracted as a new (4th) column. The table was then 3) sorted (in memory) by 4th column and 4) converted back to FASTA format.

cat seqs.fa | seqkit fx2tab | csvtk -H -t mutate -p "protein=(.+?)\]" -n protein | csvtk -H -t sort -k 4 | seqkit tab2fx > sorted_by_protein.fa

Fake sequences (with duplicated protein names)

$ cat seqs.fa
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Result

$ head sorted_by_protein.fa 
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
ADD COMMENT
0
Entering edit mode

I got an error like this

[ERRO] duplicated sequences found: hypothetical protein. use "seqkit rename" to rename duplicated IDs

then i renamed all duplicates by using this comment

seqkit rename seqs.fa

but again i got same error

ADD REPLY
1
Entering edit mode

run

seqkit rename --id-regexp "protein=(.+?)\]" seqs.fa
ADD REPLY
1
Entering edit mode

Note that changs of IDs (protein names) by seqkit rename will remain after sorting. See another solution in following reply of mine.

ADD REPLY
1
Entering edit mode

A better solution combining SeqKit and csvtk.

FASTA format was converted to tabular format, and protein name in FASTA header column (1th) was extracted as a new (4th) column. The table was then sorted by 4th column in memory and converted back to FASTA format.

cat seqs.fa | seqkit fx2tab | csvtk -H -t mutate -p  "protein=(.+?)\]" -n protein | csvtk -H -t sort -k 4 | seqkit tab2fx  > sorted_by_protein.fa

csvtk -H -t sort -k 4 could be replaced by sort -k4,4

ADD REPLY
2
Entering edit mode
7.4 years ago

To deal with duplicate protein names, you might use Perl instead of awk. It's easier to deal with duplicate keys in a Perl hash table, in that duplicate element hits can be pushed to the end of an array.

It's relatively simple and fast and doesn't require installing any special software:

#!/usr/bin/env perl                                                                                                                                                                                                                                                       

use strict;
use warnings;

my $header;
my $sequence;
my $protein;
my $data;

while (<STDIN>) {
    chomp $_;
    if ($_ =~ /^>/) {
        $header = $_;
        if ($header =~ /\[protein=([A-Za-z_0-9 ]*)/) {
            $protein = $1;
            push(@{$data->{$protein}->{header}}, $header);
        }
        else {
            die "No protein name found!\n";
        }
    }
    else {
        $sequence = $_;
        push(@{$data->{$protein}->{sequence}}, $sequence);
    }
}

foreach my $key (sort keys %{$data}) {
    my $length = @{$data->{$key}->{header}};
    for (my $idx = 0; $idx < $length; $idx++) {
        $header = $data->{$key}->{header}->[$idx];
        $sequence = $data->{$key}->{sequence}->[$idx];
        print STDOUT "$header\n$sequence\n";
    }
}

To use:

$ sort_records.pl < records.fa > sorted_records.fa
ADD COMMENT
2
Entering edit mode
7.4 years ago

you may try this simple yet powerful combination of bash commands:

perl -pe 's/[\r\n]+/;/g; s/>/\n>/g' input.fasta \
| sort -t"[" -k2,2V | sed 's/;/\n/g' | sed '/^$/d'

where perl is used to linearize all lines and to split them into 1 line per label+sequence, sort is used to version sort by the field you're interested in, and final seds are used to split labels from sequences and to remove all empty lines generated through the process.

ADD COMMENT

Login before adding your answer.

Traffic: 1583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6