Question

Help Needed For Extract Sequence Information From A Database

0

Entering edit mode

10.9 years ago

redspider19800915 ▴ 40

Help needed for a simple question:
I have a local database containing millions of sequences in fast format as follows:

>leaf_1
AAGACCATTCGAGCTTATCTCTTC
>leaf_2
ATGGAGAAGGAAATGAAGAGCAGT
>leaf_3
TGGCTGTAAGTCATACCTGTCA
>leaf_4
CGCGGAGTAGATCAGTTTGGTA
>leaf_5
AGTAACGGCTTTACAAGAATCAAA
......

I now have an query list of selected sequences of interest, which I need to extract the sequence information from the above database. For example: I need leaf_2, leaf_4 and leaf_5 sequences to be retrieved and output in tab-delimited format as follows:

>leaf_2    ATGGAGAAGGAAATGAAGAGCAGT
>leaf_4    CGCGGAGTAGATCAGTTTGGTA
>leaf_5    AGTAACGGCTTTACAAGAATCAAA

Anyone could provide a perl script for me? Thanks a lot !

perl data database • 1.8k views

ADD COMMENT • link updated 10.9 years ago by Michael 54k • written 10.9 years ago by redspider19800915 ▴ 40

3

Entering edit mode

http://whathaveyoutried.com/

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

i think this is should be read first given to any one in biostars before asking a question as it will help him to try first ask second

ADD REPLY • link 10.9 years ago by Medhat 9.7k

score 0 · Answer 1 · 2013-05-22

Citing the link Pierre has given: "This is not problem solving, and software engineering/bioinformatics(mine) is entirely about problem solving."

Here is your script:

#!/usr/bin/env perl
use strict;
use warnings;

my %select = ();

## now read your select list into that hash
## we are using a hash, because that is the fastest way to look up things
while (<SELECT>) {
.....
 }
## now parse the fasta file
while (<FASTA>) {
## for every line that 
## starts with a '>', use a regex to get the identifier
## check if key exists in %select
if (exists...) {
   print $it
   }
}

__END___