how to sort a fasta file
4
1
Entering edit mode
10 months ago
Mo ▴ 40

Hi,

I am having a fasta file which has the sequences like this:

>gene2_barcode2
CGGTGGCATTGGC

>gene1_barcode1
GCTACGTAGCTAG

>gene2_barcode1
GCCGTACGTTAGA

>gene1_barcode2
TCGTACGAGTCAC

I want to sort this file with the gene names so I get the gene1, gene2, and gene3 in a sorted manner and not randomly. How to do this?

Thanks a lot.

fasta • 1.1k views
ADD COMMENT
6
Entering edit mode
10 months ago

You can use seqkit.

seqkit sort -N in.fa

Alternatively if your sequences are indeed all just one line.

paste - - < in.fa | sort -Vk1 | tr "\t" "\n"

Both of these commands utilize natural sorting to ensure proper order.

Also, I'm not sure if this applies to your actual data, but fasta files shouldn't have a space between each entry.

ADD COMMENT
2
Entering edit mode
10 months ago

linearize, sort, restore

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  in.fa | sort -t $'\t' -k1,1V | tr "\t" "\n"
ADD COMMENT
1
Entering edit mode
10 months ago

Using Python:

This code reads your FASTA file, stores the entries in a dictionary and writes them back to a new FASTA file in a sorted order. It assumes that your FASTA file is formatted properly with each sequence header preceded by a ">".

from Bio import SeqIO

# read the FASTA file
sequences = SeqIO.to_dict(SeqIO.parse("input.fasta", "fasta"))

# sort sequences by keys (headers)
sorted_sequences = dict(sorted(sequences.items()))

# write the sorted sequences to a new FASTA file
with open("sorted.fasta", "w") as output_handle:
    SeqIO.write(sorted_sequences.values(), output_handle, "fasta")

You also need to replace "input.fasta" with the name of your FASTA file. The code will then write the sorted sequences to a file named "sorted.fasta". You can change this to any filename you prefer and s ince the script uses biopython you need to install it (if you have not already), simply: pip install biopython.

Overall the code sorts your sequences alphabetically by the header. If your headers are "gene1", "gene2", etc., they will be sorted in numerical order as well because the numbers come after the same prefix "gene". If you have headers like "gene1", "gene11", "gene2", these will not be sorted numerically because "gene11" comes alphabetically before "gene2". In this case, you'd need to adjust your headers to have consistent formatting like "gene01", "gene02", "gene11", or add additional code to sort numerically.

Good luck

ADD COMMENT
1
Entering edit mode
10 months ago
Jesse ▴ 740

With seqmagick:

seqmagick convert --sort name-asc file.fa file_sorted.fa

(Or seqmagick mogrify to overwrite the original file. --line-wrap 0 is also helpful if you're like me and find wrapped FASTAs annoying.)

It can sort by name or length, in either order:

  --sort {length-asc,length-desc,name-asc,name-desc}
                        Perform sorting by length or name, ascending or
                        descending. ASCII sorting is performed for names
ADD COMMENT

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6