Question: Extract specific fasta sequences from multiple multi-fasta files located in a directory using Perl.
0
gravatar for MB
7 months ago by
MB20
MB20 wrote:

I have a fasta sequence in a file called "seq.fasta", which I want to search into multiple multi-fasta files located in a directory "/home/fasta_sequences" and print all the matching sequences to a file "output.fasta".

seq.fasta file looks like:

>KNH123.1_gallus_gallus
NDEAHPWKPLRPGDIRGPCPGLNTLASHGYLPRNGVATPVQIINAVQEGLNFDNQAAVFA
TYAAHLVDGNLITDLLSIGRKTRLTGPDPPPPASVGGLNEHGTFEGDASMTRGDAFFGNN
HDFNETLFEQLVD

/fasta_sequences/ directory looks like:

Homo_sapiens.AG_v1.pep.all.fa
Aspergillus_aculeatus_atcc_16872.Aspac1.pep.all.fa
.
.

* Homo_sapiens.AG_v1.pep.all.fa* file looks like:

    >OJJ94681_Homo_sapiens
    MKITPIPVGVVGENWANIHDPVERRRAQNRIAQRGYRRRLRKGEERNTAQASAATLPSPP
    QQQHDQVLPPPPLSLPPPLQSYSLHPPTPPTLPSDVSPYTSPADPLMFTGEPWDPPLSLD
    VLFPDPVAYSAYPSPEPTSSGSPTLTSTTEGRSGPAHGHTALPLALPLPDLLHEPTPGLT
    HRAALHGHEWTPLHLAARQGHTRVVQLLLTSDVHHRGAVNSVLAVLLRAGGDPSLPDSAG
    RTTRRGETAFHLAVQARQHAVVRVLLEHPASYYSVNAQDCWGRTPLHLACESNQQELVEM
    LILAGAQLDIRDFEDQTPLHSACLGG
    >OJJ94682_Homo_sapiens
    AQERKARLYAVFGGQGNDEHYFEELRTLWQTYRPLVEELIIPASSLLQRESMGGGSSRFY
    HNGLDALSWLEHPERTPASAYLIGAPISMPLIGLLQLAWYRVVGRVAGVTPAQLQTALAG
    TTGHSQGILPATVIAVAQSWTQFDALALDALRVLLAIGACSQDQFAVAQLPPAIVANAEA
    HGEGFPGPMLSVADFPVAQVRSYVADVNAHLPEDARIEIALINGPHNIVLAGPPLSLHGF
    NVWLRAVKAPASGVQHRIPFSQRRPEIRNRFLPITAAFHSSYLAEVVPEVLARVPGLTIS
    GEQLRIPVYCTRTGLDLRTRGAANLVPELVAMITEQPLVWPKATQFPGATHILAFGPEGS
    AGIGSLTNRLKEGTGVRTILASGGHHSHSARAEQLGDRS
.
.

I know how to search for a string in multiple files

 #!/usr/bin/perl
    {
      local @ARGV = glob "/home/fasta_sequences/*.fa";
      while (<>) {
        print "$ARGV:$.:$_" if /TTHACVGSYYLSEDQKKLMNIFFTYRP/;
      } continue {
        close ARGV if eof;
      }
    }

But I am unable to find a way to do the same with a complete fasta sequence.

Any help would be appreciated!

perl fasta • 324 views
ADD COMMENTlink modified 7 months ago by h.mon24k • written 7 months ago by MB20
1

Hello MB,

you haven't just have the problem to find the sequence in another file. You also have to find out in which sequence it is contained if you have multiple sequences in one file.

So my question is if you realy want/need to use perl or if something like standard unix tools or e.g. seqkit is also suitable?

fin swimmer

ADD REPLYlink written 7 months ago by finswimmer11k

Thanks for your answer. I will try to use grep command from seqkit.

ADD REPLYlink written 7 months ago by MB20
2
gravatar for Dattatray Mongad
7 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad260 wrote:

You can use faSomeRecords.


faSomeRecords - Extract multiple fa records

usage:

faSomeRecords in.fa listFile out.fa

options:

-exclude - output sequences not in the list file.

in.fa:- input fasta file

listFile:- file containg headers

ADD COMMENTlink modified 7 months ago by finswimmer11k • written 7 months ago by Dattatray Mongad260

Hello doctor.dee005,

OP wants to look up given sequence in other files. But faSomeRecords is looking for sequence names, isn't it?

fin swimmer

ADD REPLYlink written 7 months ago by finswimmer11k

yes, faSomeRecords searches for headers only.

ADD REPLYlink written 7 months ago by Dattatray Mongad260

Thanks, that helped!

ADD REPLYlink written 7 months ago by MB20
2
gravatar for h.mon
7 months ago by
h.mon24k
Brazil
h.mon24k wrote:

Do you have to use Perl? CDHIT has an utility which does (almost) what you want:

CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

ADD COMMENTlink written 7 months ago by h.mon24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1035 users visited in the last hour