Question

FASTA sequence sorting based on the header

5

Entering edit mode

7.4 years ago

Eva_Maria ▴ 190

Hello I have a Fasta file which contains more than 10 lakh sequences

example

>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGGTGAATCTGGTGCAGCACTTA

based on the protein name (example protein=putative lipopolysaccharide biosynthesis glycosyltransferase) i want to sort all the sequences , similar protein name one by one

Is there any AWK or PERL programme s there for it

awk perl fasta sequence • 13k views

ADD COMMENT • link updated 7.4 years ago by michael.ante ★ 3.8k • written 7.4 years ago by Eva_Maria ▴ 190

score 3 · Answer 1 · 2016-11-23

I can only recommend using the FAST Analysis of Sequences Toolbox (FAST). It has a sorting function called 'fassort'. It is based on bioperl and was for my purposes always very fast.

$ man fassort
NAME
       fassort - sort sequences based on identifiers

SYNOPSIS
       fassort [OPTION]... [MULTIFASTA-FILE]...

DESCRIPTION
       fassort takes sequence or alignment data as input, and outputs sequence records sorted by some user-defined criteria. By default, sequences are sorted ASCIIbetically by their identifiers. Optionally sequences may be sorted by their sequences, descriptions or components of descripions.
 ...

score 2 · Answer 2 · 2016-11-19

2

Entering edit mode

7.4 years ago

Alex Reynolds 35k

Assuming all your inputs are single-line FASTA (header on one line, and sequence on the next) and that you have the protein name in the same position in the header:

$ awk ' \
    BEGIN { \
        FS = "\\[|=|\\]"; \
        idx = 1; \
    } \
    { \
        if (NR % 2) { \
            header = $0; \
            protein = $3; \
        } \
        else { \
            sequence = $0; \
            headers[protein] = header; \
            sequences[protein] = sequence; \
            proteins[idx++] = protein; \
        } \
    } \
    END { \
        n = asort(proteins); \
        for (i = 1; i <= n; i++) { \
            print headers[proteins[i]]"\n"sequences[proteins[i]]; \
        } \
    }' records.fa > sorted_records.fa

If you have multiline FASTA and want to convert it to single-line form, you could take a look at the answers here: Multiline Fasta To Single Line Fasta

Edit: If your input has headers with duplicate protein names, I added an answer with a Perl script that does the same thing as the awk script, but handles duplicate names.

ADD COMMENT • link 7.4 years ago by Alex Reynolds 35k

0

Entering edit mode

it is running perfectly but out put is coming with out header

ADD REPLY • link 7.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Are you working with a Windows-sourced text file with weird line endings, maybe? Is your input a true single-line FASTA file? Might be worth it to double-check your inputs. Perhaps troubleshoot the value of header in the else block with a print header; statement.

ADD REPLY • link 7.4 years ago by Alex Reynolds 35k

0

Entering edit mode

After changing header into single-line form the code is running perfectly

thank you Alex

ADD REPLY • link 7.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Just curious, do you have FASTA records where there could be duplicates of protein names?

ADD REPLY • link 7.4 years ago by Alex Reynolds 35k

0

Entering edit mode

yes, obviously. therefore storing records with map/hash/dict is not the best way.

ADD REPLY • link 7.4 years ago by shenwei356 8.4k

1

Entering edit mode

There are a number of ways to deal with collisions in hash tables that all work, but awk is probably not the best tool for that, which is why I ask.

ADD REPLY • link 7.4 years ago by Alex Reynolds 35k

score 2 · Answer 3 · 2016-11-19

Try SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Just download the binary executable files and run.

Fake sequences:

$ cat seqs.fa
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGGTGAATCTGGTGCAGCACTTA
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

Just ONE command with SeqKit. By default, seqkit sort sorts sequences by sequence identifier. For this case, we use --id-regexp "protein=(.+?)\]" to specify sequence IDs.

$ seqkit sort --id-regexp "protein=(.+?)\]" seqs.fa 
[INFO] read sequences ...
[INFO] 3 sequences loaded
[INFO] sorting ...
[INFO] output ...
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
cccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttt
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGG
TGAATCTGGTGCAGCACTTA

If you have large number of sequences, use --two-pass mode which has lower memory usage:

$ seqkit sort --id-regexp "protein=(.+?)\]" --two-pass seqs.fa

See more detail of seqkit sort.

------------[update]---------------

Protein names could by duplicated, for example, "hypothetical protein". Therefore, storing records with map/hash/dict is not the best way, which needs unique keys. You need to create new unique names for sorting and keep the original names for output.

A better method is:

Converting FASTA format to tab-delimited table.
Creating a new column containing the protein names.
Sorting by the new column with sort or other command
Converting back to FASTA format.

Here's an implemention combining SeqKit and csvtk.

FASTA format was 1) converted to 3-column tabular format (header, sequence, quality (blank for FASTA)), and 2) protein names in FASTA header column (1th) was extracted as a new (4th) column. The table was then 3) sorted (in memory) by 4th column and 4) converted back to FASTA format.

cat seqs.fa | seqkit fx2tab | csvtk -H -t mutate -p "protein=(.+?)\]" -n protein | csvtk -H -t sort -k 4 | seqkit tab2fx > sorted_by_protein.fa

Fake sequences (with duplicated protein names)

$ cat seqs.fa
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Result

$ head sorted_by_protein.fa 
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

score 2 · Answer 4 · 2016-11-23

To deal with duplicate protein names, you might use Perl instead of awk. It's easier to deal with duplicate keys in a Perl hash table, in that duplicate element hits can be pushed to the end of an array.

It's relatively simple and fast and doesn't require installing any special software:

#!/usr/bin/env perl                                                                                                                                                                                                                                                       

use strict;
use warnings;

my $header;
my $sequence;
my $protein;
my $data;

while (<STDIN>) {
    chomp $_;
    if ($_ =~ /^>/) {
        $header = $_;
        if ($header =~ /\[protein=([A-Za-z_0-9 ]*)/) {
            $protein = $1;
            push(@{$data->{$protein}->{header}}, $header);
        }
        else {
            die "No protein name found!\n";
        }
    }
    else {
        $sequence = $_;
        push(@{$data->{$protein}->{sequence}}, $sequence);
    }
}

foreach my $key (sort keys %{$data}) {
    my $length = @{$data->{$key}->{header}};
    for (my $idx = 0; $idx < $length; $idx++) {
        $header = $data->{$key}->{header}->[$idx];
        $sequence = $data->{$key}->{sequence}->[$idx];
        print STDOUT "$header\n$sequence\n";
    }
}

To use:

$ sort_records.pl < records.fa > sorted_records.fa

score 2 · Answer 5 · 2016-11-23

you may try this simple yet powerful combination of bash commands:

perl -pe 's/[\r\n]+/;/g; s/>/\n>/g' input.fasta \
| sort -t"[" -k2,2V | sed 's/;/\n/g' | sed '/^$/d'

where perl is used to linearize all lines and to split them into 1 line per label+sequence, sort is used to version sort by the field you're interested in, and final seds are used to split labels from sequences and to remove all empty lines generated through the process.