Question: removed entrie sequence include specific letters
0
gravatar for Jason
15 days ago by
Jason0
Jason0 wrote:

Hey All

I need help with shell command or Perl script.

I have 400 sequences and some of these sequences have X,B and Z.

I want to remove an entire sequence from fasta file that has X, B,Z

I found this shell command that will remove only this letter from the sequence

sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta

But my goal to remove any sequence include these letters.

All my sequences in one line

for example:

>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV

the result will remove entire sequences include X, B, and Z

>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3 
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3 
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
sequencing • 108 views
ADD COMMENTlink modified 14 days ago by shenwei3564.4k • written 15 days ago by Jason0
1

Will you remove sequences containing J?

    A   Ala Alanine
    B   Asx Aspartic acid or Asparagine [2]
    C   Cys Cysteine
    D   Asp Aspartic Acid
    E   Glu Glutamic Acid
    F   Phe Phenylalanine
    G   Gly Glycine
    H   His Histidine
    I   Ile Isoleucine
    J       Isoleucine or Leucine [4]
    K   Lys Lysine
    L   Leu Leucine
    M   Met Methionine
    N   Asn Asparagine
    O       pyrrolysine [6]
    P   Pro Proline
    Q   Gln Glutamine
    R   Arg Arginine
    S   Ser Serine
    T   Thr Threonine
    U   Sec selenocysteine [5,6]
    V   Val Valine
    W   Trp Tryptophan
    Y   Tyr Tyrosine
    Z   Glx Glutamine or Glutamic acid [2]
    X   unknown amino acid
    .   gaps
    *   End
Reference:
    1. http://www.bioinformatics.org/sms/iupac.html
    2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
    3. http://www.bioinformatics.org/sms2/iupac.html
    4. http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
    5. http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
    6. https://en.wikipedia.org/wiki/Amino_acid

ADD REPLYlink written 14 days ago by shenwei3564.4k
1
gravatar for JC
15 days ago by
JC7.1k
Mexico
JC7.1k wrote:

You don't need a substitution, search for a match in your sequence:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    next if ($seq =~ m/[ZXB]/); # skip sequences with Z, X or B
    print ">$_";
}

Usage: perl removeSeqs.pl < FASTA_IN > FASTA_OUT

ADD COMMENTlink written 15 days ago by JC7.1k
0
gravatar for shenwei356
14 days ago by
shenwei3564.4k
China
shenwei3564.4k wrote:

Try seqkit grep (usage).

seqkit grep -i -s -r -p '[zxb]' -v

# cat test.fa | seqkit grep --ignore-case --by-seq --use-regexp --pattern '[zxb]' --invert-match

#  seqkit grep -i -s -p z -p x -p b -v
ADD COMMENTlink written 14 days ago by shenwei3564.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour