Question: How to remove ambiguous amino acid code containing sequences from a Fasta file?
0
gravatar for arriyaz.nstu
12 days ago by
arriyaz.nstu0 wrote:

I have a fasta file with 396 protein sequences (the header in one line and amino acid codes in several lines). Some of these sequences contain ambiguous or exceptional amino acid codes (e.g., B, J, O, U, Z, X, -- ). I want to remove sequences containing such code and generate a new fasta file. How I can do this in the Ubuntu terminal? Thanks in advance.

sequence • 92 views
ADD COMMENTlink modified 12 days ago by Pierre Lindenbaum123k • written 12 days ago by arriyaz.nstu0

not sure what your goal is but simply removing those ambiguous AA from the sequence is likely not the best idea (as you will change/destroy the overall context of that protein).

Replacing the ambiguous ones with X for instance should work . Normally none the the tools dealing with protein sequences should have a problem with X as an "aminoacid" .

ADD REPLYlink written 12 days ago by lieven.sterck6.0k

I want to remove the whole sequence which contain ambiguous AA code.

ADD REPLYlink written 12 days ago by arriyaz.nstu0

I will use these sequences for population conservancy analysis in the immune epitope database. ambiguous AA containing sequence cause error during this analysis. I don't know replacing ambiguous with X will work or not, I didn't try this approach. Ok please, also suggest me how I can replace ambiguous AA with X?

ADD REPLYlink written 12 days ago by arriyaz.nstu0

building on the one liner Pierre Lindenbaum provided below :

sed '/^[^>]/s/[BJOUZ]/X/g' in.fa  > out.fa

this will replace all occurrences of B, J, O, U, Z with an X .

removing the whole sequence makes sense as well but will require different code.

ADD REPLYlink written 12 days ago by lieven.sterck6.0k

Thank you for your explanation. I am trying this code also. Please, suggest me the code to remove the whole sequence.

ADD REPLYlink written 12 days ago by arriyaz.nstu0
1
gravatar for lieven.sterck
12 days ago by
lieven.sterck6.0k
VIB, Ghent, Belgium
lieven.sterck6.0k wrote:

if you don't mind ending up a with a fasta file where the sequence is on a single line you could give the following a try:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' <your_file> | paste - - | grep -v "\t.*[BJOUZ]" | tr "\t" "\n"

this will first put all sequences on single line perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' , then put both header and sequence on a single line (tab separated) paste - - , then remove the lines containing the chars you don't want grep -v "\t.*[BJOUZ]" , and finally split header and sequence back to two lines tr "\t" "\n"

ADD COMMENTlink written 12 days ago by lieven.sterck6.0k

Thank you very much.

ADD REPLYlink written 12 days ago by arriyaz.nstu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1320 users visited in the last hour