Question: How To Filter Multi Fasta By Length??
6
gravatar for HG
4.1 years ago by
HG980
Germany
HG980 wrote:

i save this bit of code in a file called, removesmalls, made it executable “chmod +x”

#!/bin/awk -f
!/^>/ {next}
 {getline s}
length(s) >= i { print $0 “\n” s }

and then i used it with the following command $ awk removesmalls i=200 contigs.fa > contigs-l200.fa

Result :Now its generating the out put file also but without any data

Please help me out.

awk • 14k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 4.1 years ago by HG980
4

Hi, you can change the record separator RS into > and turn your script into a short one-liner:

awk 'BEGIN {RS = ">" ; ORS = ""} length($2) >= 200 {print ">"$0}' contigs.fa > contigs-l200.fa

If you want the minimum size to be an external variable, use the -v option:

awk -v min="200" 'BEGIN {RS = ">" ; ORS = ""} length($2) >= min {print ">"$0}' contigs.fa > contigs-l200.fa

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Frédéric Mahé2.6k

Thank you so much

ADD REPLYlink written 4.1 years ago by HG980
15
gravatar for Vivek Krishnakumar
4.1 years ago by
Rockville, MD
Vivek Krishnakumar240 wrote:

Your original awk script assumes that the FASTA sequence following the header line spans only one line, which is not normally the case. As suggested by SES above, bioawk is definitely a neat alternative.

A traditional Perl based approach would be like so:

## removesmalls.pl
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen);
    }
    local $/="\n";
}

You would then invoke the script like so:

$ perl removesmalls.pl 200 contigs.fasta > contigs-l200.fasta

This solution is making use of the Perl record separator (which is a "\n" newline by default) and switching it to ">" so that when iterating through the FASTA file, we encounter whole FASTA records. This addresses the issue I mentioned above.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Vivek Krishnakumar240

Thank you so much for your assessment ...

ADD REPLYlink written 4.1 years ago by HG980

Hi this script is really helpful. I have run it with my fasta files (Assembled transcriptomes) removing contigs less than 300 and the output files do not have these contigs.

Is there a way to print an output file for contigs with a minimum and maximum lenght for example for 300 to 600 bp. I need to do this task to separate (filter) my fasta files by contig length: 1) to see how many contigs of a specific lenght are (>300, 300 to 600, 600 to 1200, 1200 to 1800,,, etc) 2)These filtered files i will blast against a Uniprot dataset to see how many of these hit off?

Thanks for your help

ADD REPLYlink written 15 months ago by jmaguilarcam0

You can extend the script above to include a maxlen parameter to extract sequences within a range of lengths, like so:

# filterfastarange.pl
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
my $maxlen = shift or die "Error: `maxlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen and $seqlen <= maxlen);
    }
    local $/="\n";
}

You would the invoke this script like so:

perl filterfastarange.pl 301 600 contigs.fasta > contigs-gt300-lte600.fasta

Please note, the if() condition above assumes a closed interval [minlen, maxlen], i.e. sequence length is greater than or equal to minlen and less than or equal to maxlen. So your filter intervals should be defined appropriately: [301, 600], [601, 1200], [1201, 1800] and so on, so as to not cause any overlaps between the consecutive sequence sets.

You could easily extend this script to loop through your desired ranges.

Hope this helps!

ADD REPLYlink modified 14 months ago • written 14 months ago by Vivek Krishnakumar240
3
gravatar for SES
4.1 years ago by
SES7.9k
Vancouver, BC
SES7.9k wrote:

I recommend you try bioawk, which is a modified version of awk that will parse some common sequence formats. In my opinion, there is no reason to mess with trying to parse files if someone has already figured it out and created some easy to use methods. Also, I would argue against writing a script for simple tasks when a single line of code will do the job. Here is a one-liner to get sequences over 50 bp (will take fasta or fastq):

bioawk -c fastx '{ if(length($seq) > 50) { print ">"$name; print $seq }}' some_seq.fastq
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by SES7.9k

bioawk i tried earlier : I found some problem to install it. Once i am doing like this

$ bioawk -c fastx '{ if(length($seq) > 1000) { print ">"$name; print $seq }}' list4a.fasta 
bioawk: command not found

Then i tried to install bioawk like this

$ make 
make: `awk' is up to date.

I could not solve it . could you please check it once ??

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by HG980

I see, you will need to compile bioawk first, then create a link to awk and name it bioawk. This is not strictly necessary, but I do this so bioawk does not conflict with the system awk (both are named 'awk'). After you type make to compile it, just create a link ln -s awk bioawk and try again. Your shell will not know it's there so you'll have to add it to your PATH, or specify the full path to bioawk.

ADD REPLYlink written 4.1 years ago by SES7.9k
$ ln -s awk bioawk
    ln: failed to create symbolic link `bioawk': File exists
ADD REPLYlink written 4.1 years ago by HG980

There is no file named 'bioawk' in the distribution, which indicates you typed that command more than once. What did you see when you tried to execute the named file that exists?

ADD REPLYlink written 4.1 years ago by SES7.9k

bioawk seems to be a very nice tool!

ADD REPLYlink written 4.1 years ago by sanchezcavani210

and using awk filtering:

bioawk -c fastx '(length($seq) > 50) { print ">"$name"\n"$seq }'
ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by brentp22k
1
gravatar for cts
4.1 years ago by
cts1.5k
Pasadena
cts1.5k wrote:

In the text you give you use typographers quotes (the curly double quotes), which may be stuffing things up also could it be that your missing the -f between awk and removesmalls. Copying your text and using the following command worked for me:

awk -f removesmalls i=200 test.fa >someoutput.fa

also you should be able to do:

./removesmalls i=200 test.fa >someoutput.fa

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by cts1.5k

awk -f removesmalls i=200 list4a.fasta > list4aout.fasta
i tried with this command but same result it is generating file without any data Here is the link from where i got the code http://bioinfofly.wordpress.com/2012/07/09/binawk-f-nex/ colud you please ckeck once in your system

ADD REPLYlink written 4.1 years ago by HG980

Simply copying the code from your question or from the wordpress article produces an error for me. As I say in my answer above, the text in your question contains typographers quotes (also called smart quotes or curly quotes) around the \n in your print statement. You need to remove them and replace them with normal quotes. This can be simply done by deleting those two characters and replacing them with a standard ". With this modification and with the -f change the code you give works for me

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by cts1.5k
0
gravatar for SRKR
4.1 years ago by
SRKR160
Visakhapatnam
SRKR160 wrote:

Here is a PHP solution for your problem using Regular Expression. I have used a limit of length 30 here, you change it as per your wish. The sequences will be in fasta format in seqs.txt file and the resulting screened sequences will be written to screened_seqs.txt:

<?php
$readfile = "seqs.txt";
$writefile = "screened_seqs.txt";
$screenlen = 30;
$fr = fopen($readfile, "r");
$text = fread($fr, filesize($readfile));
$fw = fopen($writefile, "w");
$ntext = preg_replace("/\>[^\n]{1,}\n[ATGCatgc\n]{1,".($screenlen-1)."}\>/", ">", $text);
fwrite($fw, $ntext);
fclose($fr);
fclose($fw);
?>

Hope this helps you...

ADD COMMENTlink written 4.1 years ago by SRKR160

Thanks a lot.....

ADD REPLYlink written 4.1 years ago by HG980
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 643 users visited in the last hour