Question: how to rearrange fasta file according to its length
1
gravatar for akhilvbioinfo
5.4 years ago by
akhilvbioinfo160
India, chennai
akhilvbioinfo160 wrote:

hai

  i want to rearrange my fasta file according to length of sequnce 

next-gen forum sequence • 11k views
ADD COMMENTlink modified 3.9 years ago by st.ph.n2.5k • written 5.4 years ago by akhilvbioinfo160
3

Hi, welcome to Biostars.

In this forum, showing that one spent some time to search for a solution beforehand (what has been tried / which language...? ) is much appreciated. Moreover, whereas this request is quite clear, paying a bit attention to the form makes the forum easier and more pleasant to browse.

Feel free to edit your own question in order to fullfill these expectations. Thanks.

ADD REPLYlink modified 14 months ago by _r_am32k • written 5.4 years ago by Manu Prestat4.0k
12
gravatar for Pierre Lindenbaum
4.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum132k wrote:

answering because

:-P

    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
    awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
    sort -k1,1n | cut -f 2- | tr "\t" "\n"

.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Pierre Lindenbaum132k

I love the solutions using GNU utils. Wish I was better at awk!

Do you know if anyone has an at all comprehensive list of simple bioinfo file manipulations like this using just built in tools? It would be a really cool resource for people who aren't very 'sys-admin-y' and get stuck when installing things (or just don't have permissions!)

ADD REPLYlink written 4.0 years ago by Joe18k
2

biostars + search field!

ADD REPLYlink written 4.0 years ago by Alex Reynolds31k

Maybe this one? stephenturner/oneliners: Useful bash one-liners for bioinformatics. https://github.com/stephenturner/oneliners#awk--sed-for-bioinformatics

ADD REPLYlink written 3.9 years ago by AK2.0k
5
gravatar for Brian Bushnell
4.0 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Using the BBMap package:

sortbyname.sh in=file.fa out=sorted.fa length descending

Default is to sort by name, but it can also sort by length or quality.

ADD COMMENTlink written 4.0 years ago by Brian Bushnell17k
2
gravatar for shenwei356
3.9 years ago by
shenwei3565.7k
China
shenwei3565.7k wrote:

Sorting by seq length using seqkit:

$ seqkit sort -l hairpin.fa

Filtering by seq length using seqkit seq:

# before filtering
$ seqkit stat hairpin.fa
file        format  type  num_seqs    sum_len  min_len  avg_len  max_len
hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354

# length >= 100
$ seqkit seq --min-len 100 hairpin.fa | seqkit stat
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     10,975  1,565,486      100    142.6    2,354

Never worry about the installation of the seqkit (download), it provide sexecutable binary files for Linux/Windows/OS X. Just download, decompress and immediately use.

ADD COMMENTlink modified 3.8 years ago • written 3.9 years ago by shenwei3565.7k
1
gravatar for geek_y
5.4 years ago by
geek_y11k
Barcelona
geek_y11k wrote:

I would use pyfaidx

Get length of each sequence and sort (ascending or descending):

faidx  --transform chromsizes test.fasta | sort -k2,2n > sorted_list

Then extract sequences in that order:

from pyfaidx import Fasta
sq = Fasta("test.fasta")

with open("sorted_list") as regions:
    for line in regions:
        cord=line.split()
        print ">"+sq[cord[0]].long_name
        print sq[cord[0]]

or you could use script given in this repository.

ADD COMMENTlink modified 14 months ago by _r_am32k • written 5.4 years ago by geek_y11k

thank u very much sir

I want to extract all sequences of length above 30 in a large fasta file

ADD REPLYlink modified 14 months ago by _r_am32k • written 5.4 years ago by akhilvbioinfo160
faidx  --transform chromsizes test.fasta | awk '{if ($2>=30) print }' | sort -k2,2n > sorted_list
ADD REPLYlink modified 14 months ago by _r_am32k • written 5.4 years ago by geek_y11k

You can also use faFilter to extract sequences. Other options are also available.

$./faFilter

-v - invert match, select non-matching records.
    -minSize=N - Only pass sequences at least this big.
    -maxSize=N - Only pass sequences this size or smaller.
    -maxN=N Only pass sequences with fewer than this number of N's
    -uniq - Removes duplicate sequence ids, keeping the first.
    -i    - make -uniq ignore case so sequence IDs ABC and abc count as dupes.
ADD REPLYlink modified 14 months ago by _r_am32k • written 5.4 years ago by venu6.7k

thank u sir

ADD REPLYlink written 5.4 years ago by akhilvbioinfo160

Hi, sorry if this is obvious. How do you sort the fasta file into descending order using he faidx tool?

ADD REPLYlink written 11 months ago by adamhannaford10

You can sort fasta by using BBMAP

example cmd

./sortbyname.sh in=input.fa out=out.fa descending
ADD REPLYlink modified 11 months ago by _r_am32k • written 11 months ago by akhilvbioinfo160
1
gravatar for st.ph.n
3.9 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Here's a quick GUI in python

 #!/usr/bin/env python

    import Tkinter, tkFileDialog
    from Tkinter import *
    from Bio import SeqIO

    class App(object):
            def __init__(self):
                    self.root = Tk()
                    self.root.wm_title("Format Fasta")

                    self.inp = StringVar(self.root)
                    Label(self.root, text = "\nPlease provide the FASTA file containing your sequences.").pack()
                    Button(self.root, text = "FASTA", command=lambda:self.inp.set(tkFileDialog.askopenfilename())).pack()

                    self.output = StringVar(self.root)
                    Label(self.root, text = "\nPlease enter a prefix for your output file.").pack()
                    Entry(self.root, textvariable=self.output).pack()
                    Label(self.root, text = "").pack()

                    self.request = StringVar(self.root)
                    Label(self.root, text = "\nPlease enter the min. length of a sequence to keep.").pack()
                    Entry(self.root, textvariable = self.request).pack()
                    Label(self.root, text = "").pack()

                    Label(self.root, text = "").pack()
                    Button(self.root, text = "Run", command = self.clickedrun).pack()
                    Button(self.root, text = "Exit", command = sys.exit).pack()

                    self.root.geometry("375x425")
                    self.root.mainloop()

            def clickedrun(self):
                    length = self.request.get()
                    prefix = self.output.get()
                    Label(self.root, text = "\nTrimming sequences to first " + length + " bp..", fg='blue').pack()
                    inpfile = self.inp.get()
                    outfile = prefix + '.fasta'
                    with open(inpfile, 'rU') as f:
                            records = list(SeqIO.parse(f, "fasta"))
                    with open(outfile, 'w') as out:
                            for r in range(len(records)):
                                    if len(records[r].seq) > length:
                                            print >> out, '>' + records[r].id, '\n', records[r].seq
                    Label(self.root, text="\nDone!", fg='blue').pack()


    App()

Copy/paste, save as Python file. Click the 'FASTA' button to provide the path to the input fasta. Then there are two entry fields, one for the output file prefix, and another for the desired minimum length. Click 'Run', and you will get a new file in the same directory as your input file.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by st.ph.n2.5k
0
gravatar for dariober
5.4 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

A somewhat contrived way to do it with only Unix tools. Keep sequences longer than 20 and sort them in decreasing order of length:

MIN_LEN=20
INFILE=seqs.fa

awk -v RS=">" 'NR > 1{sub("\n", "\t", $0); gsub("\n", "_", $0); sub("_$", "", $0); print ">"$0}' $INFILE \
| awk -v MIN_LEN=$MIN_LEN -v FS="\t" -v OFS="\t" '{if(length($2) > MIN_LEN) {print $0, length($2)}}' \
| sort -k3,3nr \
| awk -v FS="\t" '{gsub("_", "\n", $2); print $1 "\n" $2}'

(Sequence names must not contain the tab character)

ADD COMMENTlink modified 14 months ago by _r_am32k • written 5.4 years ago by dariober11k
0
gravatar for AK
3.9 years ago by
AK2.0k
Taipei
AK2.0k wrote:

At first thought (FASTX-Toolkit + awk):

fasta_formatter -i input.fasta -t \
  | awk -F $'\t' '{print length($2) "\t" $0}' \
  | sort -k1,1nr \
  | awk '{print ">" $2 "\n" $3}' \
  | fasta_formatter -w 80
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by AK2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 982 users visited in the last hour