Filter Fasta using regexp on header
4
0
Entering edit mode
7.7 years ago
sacha ★ 2.4k

I would like to filter my fasta file using a regexp on header. For exemple, keep only sequence where size != 0

>A1;size=43
ACGTATATATATATATATAT
>A1;size=21
ACGTATATATATATATATAT
>A1;size=4
ACGTATATATATATATATAT
>A1;size=0
ACGTATATATATATATATAT
>A1;size=14
ACGTATATATATATATATAT
Fasta filter header regexp • 2.3k views
ADD COMMENT
1
Entering edit mode
7.7 years ago
sacha ★ 2.4k

Just found an incredible tools !

https://github.com/bcthomas/pullseq

I do my task with the following command :

pullseq -i test.fasta -g size=[^0]
ADD COMMENT
2
Entering edit mode
7.7 years ago
Sej Modha 5.3k

If the file is not too big and the sequences are saved in a single line stead of folded to a length then you can use the following solution.

grep -A1 --no-group-separator -vFf - file.fa < <(grep  -A1 'size=0' file.fa )
ADD COMMENT
1
Entering edit mode
7.7 years ago

You can use pyfaidx for this:

$ pip install pyfaidx
$ faidx -g size=[^0] test.fasta
ADD COMMENT
0
Entering edit mode
7.7 years ago

Here's the solution of SeqKit, using subcommand grep.

seqkit grep -r -p size=0$ seq.fa

It's ultra fast, explorer more functions of SeqKit

ADD COMMENT

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6