How to remove ambiguous amino acid code containing sequences from a Fasta file?
2
0
Entering edit mode
2.3 years ago
arriyaz.nstu ▴ 30

I have a fasta file with 396 protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new fasta file. How I can do this in the Ubuntu terminal? Thanks in advance.

sequence • 1.5k views
ADD COMMENT
0
Entering edit mode

not sure what your goal is but simply removing those ambiguous AA from the sequence is likely not the best idea (as you will change/destroy the overall context of that protein).

Replacing the ambiguous ones with X for instance should work . Normally none the the tools dealing with protein sequences should have a problem with X as an "aminoacid" .

ADD REPLY
0
Entering edit mode

I want to remove the whole sequence which contain ambiguous AA code.

ADD REPLY
0
Entering edit mode

I will use these sequences for population conservancy analysis in the immune epitope database. ambiguous AA containing sequence cause error during this analysis. I don't know replacing ambiguous with X will work or not, I didn't try this approach. Ok please, also suggest me how I can replace ambiguous AA with X?

ADD REPLY
0
Entering edit mode

building on the one liner Pierre Lindenbaum provided below :

sed '/^[^>]/s/[BJOUZ]/X/g' in.fa  > out.fa

this will replace all occurrences of B, J, O, U, Z with an X .

removing the whole sequence makes sense as well but will require different code.

ADD REPLY
0
Entering edit mode

Thank you for your explanation. I am trying this code also. Please, suggest me the code to remove the whole sequence.

ADD REPLY
1
Entering edit mode
2.3 years ago

if you don't mind ending up a with a fasta file where the sequence is on a single line you could give the following a try:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' <your_file> | paste - - | grep -v "\t.*[BJOUZ]" | tr "\t" "\n"

this will first put all sequences on single line perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' , then put both header and sequence on a single line (tab separated) paste - - , then remove the lines containing the chars you don't want grep -v "\t.*[BJOUZ]" , and finally split header and sequence back to two lines tr "\t" "\n"

ADD COMMENT
0
Entering edit mode

Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 3023 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6