Question

Sort Multiple Fasta Numerically

1

Entering edit mode

12.1 years ago

PoGibas 5.1k

My multiple fasta looks something like that:

>chr2_1000-1020

NNNNNNNNNNNNNNNN

>chr2_1-20

NNNNNNNNNNNN

>chr2_20-40

...........

>chr2-200-220

And I want to sort all the sequences in numerical order. Any suggestions?

multiple fasta • 9.2k views

ADD COMMENT • link updated 2.1 years ago by Asan Emirsale • 0 • written 12.1 years ago by PoGibas 5.1k

4

Entering edit mode

12.1 years ago

Frédéric Mahé ★ 3.2k

This simple command line should do the trick:

sed -e '/>/s/^/@/' -e '/>/s/$/#/' file.fasta | tr -d "\n" | tr "@" "\n" | sort -t "_" -k2n | tr "#" "\n" | sed -e '/^$/d'

Here is an explanation: first add special characters around the fasta header, remove new line symbols, sort numerically on the second part of the fasta header, put back new line characters, remove the empty line created during the process.

ADD COMMENT • link 12.1 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

This is interesting but it will not sort sequences such as the last one in the example correctly.

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

True. I considered the hyphen in the last one as a typo and replaced it with an underscore. am I wrong?

ADD REPLY • link 12.1 years ago by Frédéric Mahé ★ 3.2k

2

Entering edit mode

12.1 years ago

Pierre Lindenbaum 161k

Linearize your sequence as described here. But use awk or perl to pad your chrom/position with the same number of zeros.

e.g:

>chr02_000000001-000000020 NNNN...
>chr02_000000020-000000040 NNNN...
>chr02_000000200-000000240 NNNN...

then sort on this first column... I'm too lazy to write the script tonight.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.1 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

12.1 years ago

SES 8.6k

Here is a solution using cdbfasta that is fast and won't keep all the data in memory. The caveat is that cdbfasta has limitations with the size of the index it can create, but I don't know how many sequences you have so it's worth a shot. Assuming your file is named "myseqs.fasta" then:

grep ">" myseqs.fasta | sed 's/>//g' | sort -k1.6n > myseqs_idlist.txt
cdbfasta myseqs.fasta -o myseqs.fasta.index
cat myseqs_idlist.txt | cdbyank myseqs.fasta.index > myseqs_sorted.fasta
rm myseqs.fasta.index

ADD COMMENT • link 12.1 years ago by SES 8.6k

0

Entering edit mode

Nice!

This approach solves exactly the question posed, is quite performant, and introduces a very useful generic tool, cdbfasta.

However, I think cdbyank is needs another argument, namely, myseqs.fasta.cidx

and, finally, the missing clean-up step:

rm myseqs.fasta.cidx

ADD REPLY • link 12.1 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Nice catch Malcolm! I actually tested it to make sure the command worked but just made a typo. The command is correct now.

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

2.1 years ago

Asan Emirsale • 0

Hello! Also if you want to sort by sequence length you can try this for fastq

 <reads.fastq bioawk -c fastx '{print length($seq)"\t@"$name"\t"$seq"\t+\t"$qual }' | sort -V -r | awk '{print $2 "\n" $3 "\n" $4 "\n" $5}' > filtered.fastq

and this for fasta

<reads.fasta bioawk -c fastx '{print length($seq)"\t>"$name"\t"$seq }' | sort -V -r | awk '{print $2 "\n" $3}' > filtered.fasta

ADD COMMENT • link 2.1 years ago by Asan Emirsale • 0

ATpoint · Accepted Answer · 2012-03-22

The following is not the best answer, but I cannot help mentioning my alpha-numeric sort, which I use daily. It is modified from unix sort and sorts "chr10,chr2" to "chr2,chr10" and also coordinates in the right way. You can use Pierre's command line to turn fasta into one line, I will just use bioawk for simplicity:

bioawk -c fastx '{print}' in.fa | sort -k1,1N | awk '{print ">"$1;print $2}'

where "sort" here is compiled from by modified sort. For me, this alpha-numeric sort has reduced a lot of tedious works related to coordinate sorting.

Edit 2.3.21 by ATpoint

In the original answer it was awk -c fastx instead of bioawk, I made that change now.

One might also try sort -k1,1V to get a simple natural sort here in case that is of interest.