Question: select a specific length from protein file
0
gravatar for Jason
8 weeks ago by
Jason0
Jason0 wrote:

Hey everyone, I want to use shell command or Perl script

I have 200 protein sequences with different length.

I want to select only sequences with length size less than 310. Any sequences size more than or equal 310 will not save to the new file.

For example, here, I have three proteins with different lengths. All sequences are in one line.

>sp|P01892|1A02_HUMAN HLA class I histocompatibility antigen, A-2 alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=1
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGILVLFGAVITGAVVAAVMWRRKSSDRKGGSYSQAASSDSAQGSDVSLTACKV

>sp|P61916|NPC2_HUMAN NPC intracellular cholesterol transporter 2 OS=Homo sapiens OX=9606 GN=NPC2 PE=1 SV=1
MRFLAATFLLLALSTAAQAEPVQFKDCGSVDGVIKEVNVSPCPTQPCQLSKGQSYSVNVT
FTSNIQSKSSKAVVHGILMGVPVPFPIPEPDGCKSGINCPIQKDKTYSYLNKLPVKSEYP
SIKLVVEWQLQDDKNQSLFCWEIPVQIVSHL

>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVL
TAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKR
TRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCESDLRHY
YDSTIELCVGDPEIKKTSFKGDSGGPLVCNKVAQGIVSYGRNNGMPPRACTKVSSFVHWI
KKTMKRY

Result:

>sp|P61916|NPC2_HUMAN NPC intracellular cholesterol transporter 2 OS=Homo sapiens OX=9606 GN=NPC2 PE=1 SV=1
MRFLAATFLLLALSTAAQAEPVQFKDCGSVDGVIKEVNVSPCPTQPCQLSKGQSYSVNVT
FTSNIQSKSSKAVVHGILMGVPVPFPIPEPDGCKSGINCPIQKDKTYSYLNKLPVKSEYP
SIKLVVEWQLQDDKNQSLFCWEIPVQIVSHL

>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVL
TAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKR
TRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCESDLRHY
YDSTIELCVGDPEIKKTSFKGDSGGPLVCNKVAQGIVSYGRNNGMPPRACTKVSSFVHWI
shell perl • 128 views
ADD COMMENTlink modified 8 weeks ago by JC8.0k • written 8 weeks ago by Jason0
1

Take a look at seqkit (https://bioinf.shenwei.me/seqkit/usage/ ). You should be able to find a tool that will do what you need.

ADD REPLYlink written 8 weeks ago by genomax70k

Asked from time to time: How To Filter Multi Fasta By Length??

There are other posts with solutions, please search the site in case the ones from the link above (or genomax's) doesn't suit your needs.

ADD REPLYlink written 8 weeks ago by h.mon26k
0
gravatar for JC
8 weeks ago by
JC8.0k
Mexico
JC8.0k wrote:
#!/usr/bin/perl

use strict;
use warnings;

my $min_size = 310;
$/="\n>"; # read fasta per sequence block

while (<>) {
    my ($header, @seq) = split (/\n/, $_);
    my $seq = join ("", @seq);
    print $_ if ( lenght $seq <= $min_size);
}

Usage:

perl filterBySize.pl < FASTA_IN > FASTA_OUT
ADD COMMENTlink written 8 weeks ago by JC8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1266 users visited in the last hour