Question

how to rearrange fasta file according to its length

1

Entering edit mode

8.7 years ago

Eva_Maria ▴ 190

Hi

I want to rearrange my fasta file according to length of sequence

next-gen sequence • 21k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.7 years ago by Eva_Maria ▴ 190

3

Entering edit mode

Hi, welcome to Biostars.

In this forum, showing that one spent some time to search for a solution beforehand (what has been tried / which language...? ) is much appreciated. Moreover, whereas this request is quite clear, paying a bit attention to the form makes the forum easier and more pleasant to browse.

Feel free to edit your own question in order to fullfill these expectations. Thanks.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by Manu Prestat 4.1k

score 15 · Answer 1 · 2017-01-10

15

Entering edit mode

7.3 years ago

Pierre Lindenbaum 161k

answering because

This seems like simple #genomics question that doesn't seem to have a simple answer. Is there a tool to re-order a fasta file on seq length?
— Julie Blommaert (@Julie_B92) January 10, 2017

:-P

use awk to linearize the fasta: http://stackoverflow.com/documentation/bioinformatics/4194
a second awk to insert a column with the length of the sequence
sort on first column
remove the first colum
convert back to fasta

    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
    awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
    sort -k1,1n | cut -f 2- | tr "\t" "\n"

.

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I love the solutions using GNU utils. Wish I was better at awk!

Do you know if anyone has an at all comprehensive list of simple bioinfo file manipulations like this using just built in tools? It would be a really cool resource for people who aren't very 'sys-admin-y' and get stuck when installing things (or just don't have permissions!)

ADD REPLY • link 7.3 years ago by Joe 21k

2

Entering edit mode

biostars + search field!

ADD REPLY • link 7.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Maybe this one? stephenturner/oneliners: Useful bash one-liners for bioinformatics. https://github.com/stephenturner/oneliners#awk--sed-for-bioinformatics

ADD REPLY • link 7.1 years ago by AK ★ 2.2k

0

Entering edit mode

Note that this solution sorts sequences by increasing length. To sort by decreasing length, simply modify the sort call with the r parameter: sort -k1,1nr

ADD REPLY • link 2.1 years ago by acvill ▴ 340

score 6 · Answer 2 · 2017-01-10

6

Entering edit mode

7.3 years ago

Brian Bushnell 20k

Using the BBMap package:

sortbyname.sh in=file.fa out=sorted.fa length descending

Default is to sort by name, but it can also sort by length or quality.

ADD COMMENT • link 7.3 years ago by Brian Bushnell 20k

score 3 · Answer 3 · 2017-02-25

Sorting by seq length using seqkit:

$ seqkit sort -l hairpin.fa

Filtering by seq length using seqkit seq:

# before filtering
$ seqkit stat hairpin.fa
file        format  type  num_seqs    sum_len  min_len  avg_len  max_len
hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354

# length >= 100
$ seqkit seq --min-len 100 hairpin.fa | seqkit stat
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     10,975  1,565,486      100    142.6    2,354

Never worry about the installation of the seqkit (download), it provide sexecutable binary files for Linux/Windows/OS X. Just download, decompress and immediately use.

Ram · Answer 4 · 2017-02-27

Here's a quick GUI in python

 #!/usr/bin/env python

import Tkinter, tkFileDialog
from Tkinter import *
from Bio import SeqIO

class App(object):
        def __init__(self):
                self.root = Tk()
                self.root.wm_title("Format Fasta")

                self.inp = StringVar(self.root)
                Label(self.root, text = "\nPlease provide the FASTA file containing your sequences.").pack()
                Button(self.root, text = "FASTA", command=lambda:self.inp.set(tkFileDialog.askopenfilename())).pack()

                self.output = StringVar(self.root)
                Label(self.root, text = "\nPlease enter a prefix for your output file.").pack()
                Entry(self.root, textvariable=self.output).pack()
                Label(self.root, text = "").pack()

                self.request = StringVar(self.root)
                Label(self.root, text = "\nPlease enter the min. length of a sequence to keep.").pack()
                Entry(self.root, textvariable = self.request).pack()
                Label(self.root, text = "").pack()

                Label(self.root, text = "").pack()
                Button(self.root, text = "Run", command = self.clickedrun).pack()
                Button(self.root, text = "Exit", command = sys.exit).pack()

                self.root.geometry("375x425")
                self.root.mainloop()

        def clickedrun(self):
                length = self.request.get()
                prefix = self.output.get()
                Label(self.root, text = "\nTrimming sequences to first " + length + " bp..", fg='blue').pack()
                inpfile = self.inp.get()
                outfile = prefix + '.fasta'
                with open(inpfile, 'rU') as f:
                        records = list(SeqIO.parse(f, "fasta"))
                with open(outfile, 'w') as out:
                        for r in range(len(records)):
                                if len(records[r].seq) > length:
                                        print >> out, '>' + records[r].id, '\n', records[r].seq
                Label(self.root, text="\nDone!", fg='blue').pack()


App()

Copy/paste, save as Python file. Click the 'FASTA' button to provide the path to the input fasta. Then there are two entry fields, one for the output file prefix, and another for the desired minimum length. Click 'Run', and you will get a new file in the same directory as your input file.

Ram · Answer 5 · 2015-08-11

1

Entering edit mode

8.7 years ago

GouthamAtla 12k

I would use pyfaidx

Get length of each sequence and sort (ascending or descending):

faidx  --transform chromsizes test.fasta | sort -k2,2n > sorted_list

Then extract sequences in that order:

from pyfaidx import Fasta
sq = Fasta("test.fasta")

with open("sorted_list") as regions:
    for line in regions:
        cord=line.split()
        print ">"+sq[cord[0]].long_name
        print sq[cord[0]]

or you could use script given in this repository.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

thank u very much sir

I want to extract all sequences of length above 30 in a large fasta file

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by Eva_Maria ▴ 190

0

Entering edit mode

faidx  --transform chromsizes test.fasta | awk '{if ($2>=30) print }' | sort -k2,2n > sorted_list

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

You can also use faFilter to extract sequences. Other options are also available.

$./faFilter

-v - invert match, select non-matching records.
    -minSize=N - Only pass sequences at least this big.
    -maxSize=N - Only pass sequences this size or smaller.
    -maxN=N Only pass sequences with fewer than this number of N's
    -uniq - Removes duplicate sequence ids, keeping the first.
    -i    - make -uniq ignore case so sequence IDs ABC and abc count as dupes.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by venu 7.1k

0

Entering edit mode

thank u sir

ADD REPLY • link 8.7 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Hi, sorry if this is obvious. How do you sort the fasta file into descending order using he faidx tool?

ADD REPLY • link 4.2 years ago by adamhannaford1 • 0

0

Entering edit mode

You can sort fasta by using BBMAP

example cmd

./sortbyname.sh in=input.fa out=out.fa descending

ADD REPLY • link updated 4.2 years ago by Ram 43k • written 4.2 years ago by Eva_Maria ▴ 190

Ram · Answer 6 · 2015-08-11

A somewhat contrived way to do it with only Unix tools. Keep sequences longer than 20 and sort them in decreasing order of length:

MIN_LEN=20
INFILE=seqs.fa

awk -v RS=">" 'NR > 1{sub("\n", "\t", $0); gsub("\n", "_", $0); sub("_$", "", $0); print ">"$0}' $INFILE \
| awk -v MIN_LEN=$MIN_LEN -v FS="\t" -v OFS="\t" '{if(length($2) > MIN_LEN) {print $0, length($2)}}' \
| sort -k3,3nr \
| awk -v FS="\t" '{gsub("_", "\n", $2); print $1 "\n" $2}'

(Sequence names must not contain the tab character)

score 0 · Answer 7 · 2017-02-25

0

Entering edit mode

7.1 years ago

AK ★ 2.2k

At first thought (FASTX-Toolkit + awk):

fasta_formatter -i input.fasta -t \
  | awk -F $'\t' '{print length($2) "\t" $0}' \
  | sort -k1,1nr \
  | awk '{print ">" $2 "\n" $3}' \
  | fasta_formatter -w 80

ADD COMMENT • link 7.1 years ago by AK ★ 2.2k