Here's is a one-liner that requires (seqtk; https://github.com/lh3/seqtk) and pcregrep that first makes the FASTA sequences on multiple lines onto a single line (
seqtk seq -l0), then it puts the FASTA header and sequence on the same line separated by a tab (
paste - -), next searches for lines with AED between 0.0-0.49 (
pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"), finally splits the FASTA header and sequence into two lines by converting the tab into a new-line (
tr '\t' '\n').
--buffer-size is not available in all versions of pcregrep (I am using Ubuntu 16.04, pcregrep version 8.38 2015-11-23), the
--buffer size option might be needed for very long transcripts
Here is the complete command
seqtk seq -l0 OryPal1.all.fasta.all.maker.proteins.fasta | paste - - | pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]" |tr '\t' '\n' > OryPal1.all.fasta.all.maker.proteins.0-0.49_AED.fasta
OryPal1.all.fasta.all.maker.proteins.fasta is the output from MAKER's
grep -e " AED:0.[0-4][0-9]" also works instead of
pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"
modified 16 months ago
16 months ago by
jean.elbers • 1.4k