How to filter .fasta file based on conditional statement
4
1
Entering edit mode
3 months ago
rob_DNA ▴ 20

Hi all,

I have a .fasta file resulting from vsearch clustering. The sequences in the .fasta file look like:

>centroid=211650b5-4541-47e4-a7a4-3659962f9818;seqs=2236   
GAGATGATGATGATATAATT

the "seqs" parameter in the sequence header, reflects the number of reads of that cluster consensus that was present in the original input file.

I now want to remove sequences that have a value of "seqs" below a certain threshold. (for example 10) I want to use a conditional statement for this, but I cannot seem to find software that can be used for this. I checked things like SeqKit and Seqtk, but these only allow for regular expression filtering. I also find it hard to use bash/awk, as it is .fasta format.

I'd need something like (in pseudocode):

 for sequence in fasta:
           if seqs < value:
           remove sequence

How could I filter based on a conditional statement? Thanks!

abundance filtering fasta vsearch • 870 views
ADD COMMENT
4
Entering edit mode
3 months ago

Your post inspired me to add this functionality into BBTools (in the next release). I've been using custom code in individual tools to filter by cutoffs in header terms like "count=123", "score=0.57", or "cov=37.2", but a more generalized approach would have been better.

Edit: OK! Now released in v39.06:

reformat.sh in=x.fasta out=filtered.fasta tag=seqs= delimiter=; minvalue=10

Other relevant flags:

tag=                    Look for this tag in the header to filter by the next value.  To filter reads
                        with a header like 'foo,depth=5.5,bar' where you only want depths
                        of at least 3, the necessary flags would be 'tag=depth= minvalue=3 delimiter=,'
delimiter=              Character after the end of the value, such as delimiter=X.  Control and
                        whitespace symbols may be spelled out, like delimiter=tab or delimiter=pipe.
                        The tag may contain the delimiter.  If the value is the last term in the header,
                        the delimiter doesn't matter but is still required.
minvalue=               If set, only accept a numeric value of at least this.
maxvalue=               If set, only accept a numeric value of at most this.
value=                  If set, only accept a string value of exactly this.
invertfilters=f         (invert) Output failing reads instead of passing reads.

For completeness, here are the control character delimiters you can spell out when they become problematic (some of them are difficult to escape):

    space
    tab
#   pound
>   greaterthan
<   lessthan
=   equals
:   colon
;   semicolon
!   bang
&   and or ampersand
"   quote or doublequote
'   singlequote or apostrophe
\   backslash
^   hat or caret
$   dollar
.   dot
|   pipe or or
?   questionmark
*   star or asterisk
+   plus
(   openparen
)   closeparen
[   opensquare
{   opencurly
ADD COMMENT
0
Entering edit mode

Thank you very much @Brian Bushnell for this addition to BBTools! Based on all replies, I am quite surprised that such a tool did not exist yet.

I am getting an error however when I run following line: reformat.sh in=input.fasta out=output.fasta tags=seqs= delimiter=; minvalue=10

error =unknown parameter tag=seqs= Exception in thread "main" java.lang.AssertionError: Unknown parameter tag=seqs= at jgi.ReformatReads.<init>(ReformatReads.java:228) at jgi.ReformatReads.main(ReformatReads.java:52)

the entries in the input fasta file look like:

>centroid=40352020-a4fc-4473-9c18-ef20ed476c2d;seqs=2501 AAGAAATTTAATAATTTTGAAAATGGATTTTTTTTTTTGTTTTGGCAAGAGCATGAGAGCTTTTACTGGGCAAGAAGACAAGAGATGGAGAGTCCAGCCGGGCCTGCGCTTAAGTGCGCGGTCTTGCTAGGCTTGTAAGTTTCTTTCTTGCTATTCCAA

I downloaded BBMap_39.06.tar.gz

ADD REPLY
0
Entering edit mode

Looks like we each made a mistake here. First off, the flag is "tag", not "tags". Second, "delimiter=;" doesn't work because it's a control character. So you can do either of these, which I just tested:

reformat.sh in=input.fasta out=output.fasta tag=seqs= delimiter=semicolon minvalue=10

or

reformat.sh in=input.fasta out=output.fasta tag=seqs= delimiter="\;" minvalue=10
ADD REPLY
0
Entering edit mode

First of all, sorry for the late reply Brian.

I still get the same error when I run the commands above:

 reformat.sh in=test_fasta.fasta out=test.fasta tag=seqs=
 delimiter=semicolon minvalue=10 java -ea -Xms300m -cp
 /home/robbert/miniconda3/opt/bbmap-39.01-1/current/ jgi.ReformatReads
 in=test_fasta.fasta out=test.fasta tag=seqs= delimiter=semicolon
 minvalue=10 Executing jgi.ReformatReads [in=test_fasta.fasta,
 out=test.fasta, tag=seqs=, delimiter=semicolon, minvalue=10]

 Unknown parameter tag=seqs= Exception in thread "main"
 java.lang.AssertionError: Unknown parameter tag=seqs=  at
 jgi.ReformatReads.<init>(ReformatReads.java:228)   at
 jgi.ReformatReads.main(ReformatReads.java:52)

my java version =

 java --version openjdk 11.0.15-internal 2022-04-19 OpenJDK Runtime
 Environment (build 11.0.15-internal+0-adhoc..src) OpenJDK 64-Bit
 Server VM (build 11.0.15-internal+0-adhoc..src, mixed mode)

I now made a test_fasta.fasta file which headers look like on the image, and which only contains the 3 sequences displayed in the image: enter image description here

is there something I'm missing?

ADD REPLY
0
Entering edit mode

You're using BBTools v39.01. I just added the new flag in 39.06 :)

ADD REPLY
0
Entering edit mode

Oh wow that was a bit stupid.. I downloaded 39.06 but I accidentally used 39.01... anyway, thank you very much for your help!! I now used the correct version and it works like a charm :-)

ADD REPLY
3
Entering edit mode
3 months ago
tshtatland ▴ 180

Use SeqKit to convert fasta to tsv and back, and use Perl/awk to filter like so:

seqkit fx2tab inp.fasta | perl -F'\t' -lane '%val = map { split /=/ } split /;\s*/, $F[0]; print if ( $val{seqs} > 1 and $val{centroid} ne "bar" );' | seqkit tab2fx > out.fasta

In this example I filter on seqs and centroid fields.

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into @F on TAB, rather than on whitespace.

Use conda to install seqkit or many other common bioinformatics packages. For example:

conda install --channel bioconda seqkit

or

conda create --channel bioconda --name seq seqkit

See also:

ADD COMMENT
1
Entering edit mode
3 months ago

assuming the 'seq' is always the last item in the fasta header;

awk -F '='  '/^>/ {X=int($NF)< 12345;} {if(X==1) print;}' in.fasta
ADD COMMENT
0
Entering edit mode
3 months ago
size_t ▴ 120

try this perl script in bash shell:

perl -e 'open F,shift; $/=">"; <F>; while(<F>){chomp; $_ =~ m/seqs=(\d+)/; next if $1 < 10; print ">$_"; } close F;' your.fa

ADD COMMENT

Login before adding your answer.

Traffic: 1059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6