Question: Sort Multiple Fasta Numerically
1
gravatar for PoGibas
7.3 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

My multiple fasta looks something like that:

>chr2_1000-1020

NNNNNNNNNNNNNNNN

>chr2_1-20

NNNNNNNNNNNN

>chr2_20-40

...........

>chr2-200-220

And I want to sort all the sequences in numerical order. Any suggestions?

fasta multiple • 5.7k views
ADD COMMENTlink written 7.3 years ago by PoGibas4.8k
6
gravatar for lh3
7.3 years ago by
lh331k
United States
lh331k wrote:

The following is not the best answer, but I cannot help mentioning my alpha-numeric sort, which I use daily. It is modified from unix sort and sorts "chr10,chr2" to "chr2,chr10" and also coordinates in the right way. You can use Pierre's command line to turn fasta into one line, I will just use bioawk for simplicity:

awk -c fastx '{print}' in.fa | sort -k1,1N | awk '{print ">"$1;print $2}'

where "sort" here is compiled from by modified sort. For me, this alpha-numeric sort has reduced a lot of tedious works related to coordinate sorting.

ADD COMMENTlink written 7.3 years ago by lh331k
4
gravatar for Frédéric Mahé
7.3 years ago by
France, Montpellier, CIRAD
Frédéric Mahé2.9k wrote:

This simple command line should do the trick:

sed -e '/>/s/^/@/' -e '/>/s/$/#/' file.fasta | tr -d "\n" | tr "@" "\n" | sort -t "_" -k2n | tr "#" "\n" | sed -e '/^$/d'

Here is an explanation: first add special characters around the fasta header, remove new line symbols, sort numerically on the second part of the fasta header, put back new line characters, remove the empty line created during the process.

ADD COMMENTlink written 7.3 years ago by Frédéric Mahé2.9k

This is interesting but it will not sort sequences such as the last one in the example correctly.

ADD REPLYlink written 7.3 years ago by SES8.2k

True. I considered the hyphen in the last one as a typo and replaced it with an underscore. am I wrong?

ADD REPLYlink written 7.3 years ago by Frédéric Mahé2.9k
2
gravatar for Pierre Lindenbaum
7.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

linearize your sequence as described here: http://biostar.stackexchange.com/questions/18881/choosing-random-set-of-seqs-from-larger-set/18888#18888 . But use awk or perl to pad your chrom/position with the same number of zeros.

e.g:

>chr02_000000001-000000020 NNNN...
>chr02_000000020-000000040 NNNN...
>chr02_000000200-000000240 NNNN...

then sort on this first column... I'm too lazy to write the script tonight.

ADD COMMENTlink written 7.3 years ago by Pierre Lindenbaum121k
2
gravatar for SES
7.3 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Here is a solution using cdbfasta that is fast and won't keep all the data in memory. The caveat is that cdbfasta has limitations with the size of the index it can create, but I don't know how many sequences you have so it's worth a shot. Assuming your file is named "myseqs.fasta" then:

grep ">" myseqs.fasta | sed 's/>//g' | sort -k1.6n > myseqs_idlist.txt
cdbfasta myseqs.fasta -o myseqs.fasta.index
cat myseqs_idlist.txt | cdbyank myseqs.fasta.index > myseqs_sorted.fasta
rm myseqs.fasta.index
ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by SES8.2k

Nice!

This approach solves exactly the question posed, is quite performant, and introduces a very useful generic tool, cdbfasta.

However, I think cdbyank is needs another argument, namely, myseqs.fasta.cidx

and, finally, the missing clean-up step:

rm myseqs.fasta.cidx
ADD REPLYlink written 7.3 years ago by Malcolm.Cook1.1k

Nice catch Malcolm! I actually tested it to make sure the command worked but just made a typo. The command is correct now.

ADD REPLYlink written 7.3 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1232 users visited in the last hour