Question: how to rearrange fasta file according to its length
0
gravatar for akhilvbioinfo
4.4 years ago by
akhilvbioinfo140
India, chennai
akhilvbioinfo140 wrote:

hai

  i want to rearrange my fasta file according to length of sequnce 

next-gen forum sequence • 8.8k views
ADD COMMENTlink modified 2.9 years ago by st.ph.n2.5k • written 4.4 years ago by akhilvbioinfo140
3

Hi, welcome to Biostars.

In this forum, showing that one spent some time to search for a solution beforehand (what has been tried / which language...? ) is much appreciated. Moreover, whereas this request is quite clear, paying a bit attention to the form makes the forum easier and more pleasant to browse.

Feel free to edit your own question in order to fullfill these expectations. Thanks.

ADD REPLYlink modified 10 weeks ago by RamRS25k • written 4.4 years ago by Manu Prestat3.9k
11
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

answering because

:-P

    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
    awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
    sort -k1,1n | cut -f 2- | tr "\t" "\n"

.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum125k

I love the solutions using GNU utils. Wish I was better at awk!

Do you know if anyone has an at all comprehensive list of simple bioinfo file manipulations like this using just built in tools? It would be a really cool resource for people who aren't very 'sys-admin-y' and get stuck when installing things (or just don't have permissions!)

ADD REPLYlink written 3.0 years ago by Joe15k
2

biostars + search field!

ADD REPLYlink written 3.0 years ago by Alex Reynolds29k

Maybe this one? stephenturner/oneliners: Useful bash one-liners for bioinformatics. https://github.com/stephenturner/oneliners#awk--sed-for-bioinformatics

ADD REPLYlink written 2.9 years ago by SMK1.9k
5
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Using the BBMap package:

sortbyname.sh in=file.fa out=sorted.fa length descending

Default is to sort by name, but it can also sort by length or quality.

ADD COMMENTlink written 3.0 years ago by Brian Bushnell17k
1
gravatar for geek_y
4.4 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

I would use pyfaidx

Get length of each sequence and sort (ascending or descending):

faidx  --transform chromsizes test.fasta | sort -k2,2n > sorted_list

Then extract sequences in that order:

from pyfaidx import Fasta
sq = Fasta("test.fasta")

with open("sorted_list") as regions:
    for line in regions:
        cord=line.split()
        print ">"+sq[cord[0]].long_name
        print sq[cord[0]]

or you could use script given in this repository.

ADD COMMENTlink modified 8 weeks ago by RamRS25k • written 4.4 years ago by geek_y10k

thank u very much sir

I want to extract all sequences of length above 30 in a large fasta file

ADD REPLYlink modified 8 weeks ago by RamRS25k • written 4.4 years ago by akhilvbioinfo140
faidx  --transform chromsizes test.fasta | awk '{if ($2>=30) print }' | sort -k2,2n > sorted_list
ADD REPLYlink modified 8 weeks ago by RamRS25k • written 4.4 years ago by geek_y10k

You can also use faFilter to extract sequences. Other options are also available.

$./faFilter

-v - invert match, select non-matching records.
    -minSize=N - Only pass sequences at least this big.
    -maxSize=N - Only pass sequences this size or smaller.
    -maxN=N Only pass sequences with fewer than this number of N's
    -uniq - Removes duplicate sequence ids, keeping the first.
    -i    - make -uniq ignore case so sequence IDs ABC and abc count as dupes.
ADD REPLYlink modified 8 weeks ago by RamRS25k • written 4.4 years ago by venu6.3k

thank u sir

ADD REPLYlink written 4.4 years ago by akhilvbioinfo140
1
gravatar for shenwei356
2.9 years ago by
shenwei3565.0k
China
shenwei3565.0k wrote:

Sorting by seq length using seqkit:

$ seqkit sort -l hairpin.fa

Filtering by seq length using seqkit seq:

# before filtering
$ seqkit stat hairpin.fa
file        format  type  num_seqs    sum_len  min_len  avg_len  max_len
hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354

# length >= 100
$ seqkit seq --min-len 100 hairpin.fa | seqkit stat
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     10,975  1,565,486      100    142.6    2,354

Never worry about the installation of the seqkit (download), it provide sexecutable binary files for Linux/Windows/OS X. Just download, decompress and immediately use.

ADD COMMENTlink modified 2.8 years ago • written 2.9 years ago by shenwei3565.0k
1
gravatar for st.ph.n
2.9 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Here's a quick GUI in python

 #!/usr/bin/env python

    import Tkinter, tkFileDialog
    from Tkinter import *
    from Bio import SeqIO

    class App(object):
            def __init__(self):
                    self.root = Tk()
                    self.root.wm_title("Format Fasta")

                    self.inp = StringVar(self.root)
                    Label(self.root, text = "\nPlease provide the FASTA file containing your sequences.").pack()
                    Button(self.root, text = "FASTA", command=lambda:self.inp.set(tkFileDialog.askopenfilename())).pack()

                    self.output = StringVar(self.root)
                    Label(self.root, text = "\nPlease enter a prefix for your output file.").pack()
                    Entry(self.root, textvariable=self.output).pack()
                    Label(self.root, text = "").pack()

                    self.request = StringVar(self.root)
                    Label(self.root, text = "\nPlease enter the min. length of a sequence to keep.").pack()
                    Entry(self.root, textvariable = self.request).pack()
                    Label(self.root, text = "").pack()

                    Label(self.root, text = "").pack()
                    Button(self.root, text = "Run", command = self.clickedrun).pack()
                    Button(self.root, text = "Exit", command = sys.exit).pack()

                    self.root.geometry("375x425")
                    self.root.mainloop()

            def clickedrun(self):
                    length = self.request.get()
                    prefix = self.output.get()
                    Label(self.root, text = "\nTrimming sequences to first " + length + " bp..", fg='blue').pack()
                    inpfile = self.inp.get()
                    outfile = prefix + '.fasta'
                    with open(inpfile, 'rU') as f:
                            records = list(SeqIO.parse(f, "fasta"))
                    with open(outfile, 'w') as out:
                            for r in range(len(records)):
                                    if len(records[r].seq) > length:
                                            print >> out, '>' + records[r].id, '\n', records[r].seq
                    Label(self.root, text="\nDone!", fg='blue').pack()


    App()

Copy/paste, save as Python file. Click the 'FASTA' button to provide the path to the input fasta. Then there are two entry fields, one for the output file prefix, and another for the desired minimum length. Click 'Run', and you will get a new file in the same directory as your input file.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by st.ph.n2.5k
0
gravatar for dariober
4.4 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

A somewhat contrived way to do it with only Unix tools. Keep sequences longer than 20 and sort them in decreasing order of length:

MIN_LEN=20
INFILE=seqs.fa

awk -v RS=">" 'NR > 1{sub("\n", "\t", $0); gsub("\n", "_", $0); sub("_$", "", $0); print ">"$0}' $INFILE \
| awk -v MIN_LEN=$MIN_LEN -v FS="\t" -v OFS="\t" '{if(length($2) > MIN_LEN) {print $0, length($2)}}' \
| sort -k3,3nr \
| awk -v FS="\t" '{gsub("_", "\n", $2); print $1 "\n" $2}'

(Sequence names must not contain the tab character)

ADD COMMENTlink modified 8 weeks ago by RamRS25k • written 4.4 years ago by dariober10k
0
gravatar for SMK
2.9 years ago by
SMK1.9k
SMK1.9k wrote:

At first thought (FASTX-Toolkit + awk):

fasta_formatter -i input.fasta -t \
  | awk -F $'\t' '{print length($2) "\t" $0}' \
  | sort -k1,1nr \
  | awk '{print ">" $2 "\n" $3}' \
  | fasta_formatter -w 80
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by SMK1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour