Question: removed entrie sequence include specific letters
0
gravatar for Jason
2.2 years ago by
Jason0
Jason0 wrote:

Hey All

I need help with shell command or Perl script.

I have 400 sequences and some of these sequences have X,B and Z.

I want to remove an entire sequence from fasta file that has X, B,Z

I found this shell command that will remove only this letter from the sequence

sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta

But my goal to remove any sequence include these letters.

All my sequences in one line

for example:

>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV

the result will remove entire sequences include X, B, and Z

>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3 
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3 
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
sequencing • 563 views
ADD COMMENTlink modified 2.2 years ago by shenwei3565.8k • written 2.2 years ago by Jason0
1

Will you remove sequences containing J?

    A   Ala Alanine
    B   Asx Aspartic acid or Asparagine [2]
    C   Cys Cysteine
    D   Asp Aspartic Acid
    E   Glu Glutamic Acid
    F   Phe Phenylalanine
    G   Gly Glycine
    H   His Histidine
    I   Ile Isoleucine
    J       Isoleucine or Leucine [4]
    K   Lys Lysine
    L   Leu Leucine
    M   Met Methionine
    N   Asn Asparagine
    O       pyrrolysine [6]
    P   Pro Proline
    Q   Gln Glutamine
    R   Arg Arginine
    S   Ser Serine
    T   Thr Threonine
    U   Sec selenocysteine [5,6]
    V   Val Valine
    W   Trp Tryptophan
    Y   Tyr Tyrosine
    Z   Glx Glutamine or Glutamic acid [2]
    X   unknown amino acid
    .   gaps
    *   End
Reference:
    1. http://www.bioinformatics.org/sms/iupac.html
    2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
    3. http://www.bioinformatics.org/sms2/iupac.html
    4. http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
    5. http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
    6. https://en.wikipedia.org/wiki/Amino_acid

ADD REPLYlink written 2.2 years ago by shenwei3565.8k
1
gravatar for JC
2.2 years ago by
JC12k
Mexico
JC12k wrote:

You don't need a substitution, search for a match in your sequence:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    next if ($seq =~ m/[ZXB]/); # skip sequences with Z, X or B
    print ">$_";
}

Usage: perl removeSeqs.pl < FASTA_IN > FASTA_OUT

ADD COMMENTlink written 2.2 years ago by JC12k
0
gravatar for shenwei356
2.2 years ago by
shenwei3565.8k
China
shenwei3565.8k wrote:

Try seqkit grep (usage).

seqkit grep -i -s -r -p '[zxb]' -v

# cat test.fa | seqkit grep --ignore-case --by-seq --use-regexp --pattern '[zxb]' --invert-match

#  seqkit grep -i -s -p z -p x -p b -v
ADD COMMENTlink written 2.2 years ago by shenwei3565.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1512 users visited in the last hour
_