Question: removed entrie sequence include specific letters
0
gravatar for Jason
18 months ago by
Jason0
Jason0 wrote:

Hey All

I need help with shell command or Perl script.

I have 400 sequences and some of these sequences have X,B and Z.

I want to remove an entire sequence from fasta file that has X, B,Z

I found this shell command that will remove only this letter from the sequence

sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta

But my goal to remove any sequence include these letters.

All my sequences in one line

for example:

>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV

the result will remove entire sequences include X, B, and Z

>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3 
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3 
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
sequencing • 468 views
ADD COMMENTlink modified 18 months ago by shenwei3565.2k • written 18 months ago by Jason0
1

Will you remove sequences containing J?

    A   Ala Alanine
    B   Asx Aspartic acid or Asparagine [2]
    C   Cys Cysteine
    D   Asp Aspartic Acid
    E   Glu Glutamic Acid
    F   Phe Phenylalanine
    G   Gly Glycine
    H   His Histidine
    I   Ile Isoleucine
    J       Isoleucine or Leucine [4]
    K   Lys Lysine
    L   Leu Leucine
    M   Met Methionine
    N   Asn Asparagine
    O       pyrrolysine [6]
    P   Pro Proline
    Q   Gln Glutamine
    R   Arg Arginine
    S   Ser Serine
    T   Thr Threonine
    U   Sec selenocysteine [5,6]
    V   Val Valine
    W   Trp Tryptophan
    Y   Tyr Tyrosine
    Z   Glx Glutamine or Glutamic acid [2]
    X   unknown amino acid
    .   gaps
    *   End
Reference:
    1. http://www.bioinformatics.org/sms/iupac.html
    2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
    3. http://www.bioinformatics.org/sms2/iupac.html
    4. http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
    5. http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
    6. https://en.wikipedia.org/wiki/Amino_acid

ADD REPLYlink written 18 months ago by shenwei3565.2k
1
gravatar for JC
18 months ago by
JC10k
Mexico
JC10k wrote:

You don't need a substitution, search for a match in your sequence:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    next if ($seq =~ m/[ZXB]/); # skip sequences with Z, X or B
    print ">$_";
}

Usage: perl removeSeqs.pl < FASTA_IN > FASTA_OUT

ADD COMMENTlink written 18 months ago by JC10k
0
gravatar for shenwei356
18 months ago by
shenwei3565.2k
China
shenwei3565.2k wrote:

Try seqkit grep (usage).

seqkit grep -i -s -r -p '[zxb]' -v

# cat test.fa | seqkit grep --ignore-case --by-seq --use-regexp --pattern '[zxb]' --invert-match

#  seqkit grep -i -s -p z -p x -p b -v
ADD COMMENTlink written 18 months ago by shenwei3565.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1645 users visited in the last hour