Sort Multiple Fasta Numerically
4
1
Entering edit mode
9.5 years ago
PoGibas 4.9k

My multiple fasta looks something like that:

>chr2_1000-1020

NNNNNNNNNNNNNNNN

>chr2_1-20

NNNNNNNNNNNN

>chr2_20-40

...........

>chr2-200-220

And I want to sort all the sequences in numerical order. Any suggestions?

multiple fasta • 7.0k views
7
Entering edit mode
9.5 years ago
lh3 32k

The following is not the best answer, but I cannot help mentioning my alpha-numeric sort, which I use daily. It is modified from unix sort and sorts "chr10,chr2" to "chr2,chr10" and also coordinates in the right way. You can use Pierre's command line to turn fasta into one line, I will just use bioawk for simplicity:

bioawk -c fastx '{print}' in.fa | sort -k1,1N | awk '{print ">"$1;print$2}'


where "sort" here is compiled from by modified sort. For me, this alpha-numeric sort has reduced a lot of tedious works related to coordinate sorting.

Edit 2.3.21 by ATpoint

In the original answer it was awk -c fastx instead of bioawk, I made that change now.

One might also try sort -k1,1V to get a simple natural sort here in case that is of interest.

4
Entering edit mode
9.5 years ago

This simple command line should do the trick:

sed -e '/>/s/^/@/' -e '/>/s/$/#/' file.fasta | tr -d "\n" | tr "@" "\n" | sort -t "_" -k2n | tr "#" "\n" | sed -e '/^$/d'


Here is an explanation: first add special characters around the fasta header, remove new line symbols, sort numerically on the second part of the fasta header, put back new line characters, remove the empty line created during the process.

0
Entering edit mode

This is interesting but it will not sort sequences such as the last one in the example correctly.

0
Entering edit mode

True. I considered the hyphen in the last one as a typo and replaced it with an underscore. am I wrong?

2
Entering edit mode
9.5 years ago

Linearize your sequence as described here. But use awk or perl to pad your chrom/position with the same number of zeros.

e.g:

>chr02_000000001-000000020 NNNN...
>chr02_000000020-000000040 NNNN...
>chr02_000000200-000000240 NNNN...


then sort on this first column... I'm too lazy to write the script tonight.

2
Entering edit mode
9.5 years ago
SES 8.5k

Here is a solution using cdbfasta that is fast and won't keep all the data in memory. The caveat is that cdbfasta has limitations with the size of the index it can create, but I don't know how many sequences you have so it's worth a shot. Assuming your file is named "myseqs.fasta" then:

grep ">" myseqs.fasta | sed 's/>//g' | sort -k1.6n > myseqs_idlist.txt
cdbfasta myseqs.fasta -o myseqs.fasta.index
cat myseqs_idlist.txt | cdbyank myseqs.fasta.index > myseqs_sorted.fasta
rm myseqs.fasta.index

0
Entering edit mode

Nice!

This approach solves exactly the question posed, is quite performant, and introduces a very useful generic tool, cdbfasta.

However, I think cdbyank is needs another argument, namely, myseqs.fasta.cidx

and, finally, the missing clean-up step:

rm myseqs.fasta.cidx

0
Entering edit mode

Nice catch Malcolm! I actually tested it to make sure the command worked but just made a typo. The command is correct now.