I have a fasta file with 396 protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new fasta file. How I can do this in the Ubuntu terminal? Thanks in advance.
if you don't mind ending up a with a fasta file where the sequence is on a single line you could give the following a try:
perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' <your_file> | paste - - | grep -v "\t.*[BJOUZ]" | tr "\t" "\n"
this will first put all sequences on single line
perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' , then put both header and sequence on a single line (tab separated)
paste - - , then remove the lines containing the chars you don't want
grep -v "\t.*[BJOUZ]" , and finally split header and sequence back to two lines
tr "\t" "\n"