Filtering fasta file based on identifier
2
2
Entering edit mode
9.4 years ago
jyu429 ▴ 120

Hi, I have a fasta file with many segments and I want to filter out all the segments that have a "P" in the identifier of the segment. Is there a conventional way to do so? Thanks.

fasta filter • 4.1k views
ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY
4
Entering edit mode
9.4 years ago
Ram 43k
bioawk -c fastx '$name ~ /P/ { print ">"$name; print $seq }' <sequences.fa

If you wanna take all except those with a "P",

bioawk -c fastx '$name ! /P/ { print ">"$name; print $seq }' <sequences.fa

bioawk here

ADD COMMENT
1
Entering edit mode

Neat. I've not found a use for bioawk before but this seems perfect.

ADD REPLY
0
Entering edit mode

It kinda clicked out of the blue for me yesterday. Now I'm gonna add this to my arsenal of regular-use tools :-)

ADD REPLY
0
Entering edit mode

how can give the transcriptome.fasta and headerlist.txt in this command?

ADD REPLY
0
Entering edit mode

What are those two files?

ADD REPLY
0
Entering edit mode
9.4 years ago

just awk

awk '/^>/{N=0} /^>P/{N=1} {if(N)print}' *.fa
ADD COMMENT
0
Entering edit mode

Maybe

/^>\S*P\S*/

To match identifiers (up to the first space) that contain P rather than just identifiers that start with P.

ADD REPLY
0
Entering edit mode

Would this not print only headers, Pierre?

ADD REPLY
0
Entering edit mode

no, if there is no 'next' statement, awk continues to scan all the patterns.

ADD REPLY
0
Entering edit mode

Oops, I read it wrong. I read it as the if(N) print being in the same {} as the N=1. My bad!

ADD REPLY
0
Entering edit mode

But where does the ouput go? Sorry for my ignorance.

ADD REPLY
0
Entering edit mode

"standard out" or "stdout". You can redirect this to a file like:

awk '/^>/{N=0} /^>P/{N=1} {if(N)print}' *.fa > out.fa
ADD REPLY

Login before adding your answer.

Traffic: 2263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6