Help Needed For Extract Sequence Information From A Database
1
0
Entering edit mode
10.9 years ago

Help needed for a simple question:
I have a local database containing millions of sequences in fast format as follows:

>leaf_1
AAGACCATTCGAGCTTATCTCTTC
>leaf_2
ATGGAGAAGGAAATGAAGAGCAGT
>leaf_3
TGGCTGTAAGTCATACCTGTCA
>leaf_4
CGCGGAGTAGATCAGTTTGGTA
>leaf_5
AGTAACGGCTTTACAAGAATCAAA
......

I now have an query list of selected sequences of interest, which I need to extract the sequence information from the above database. For example: I need leaf_2, leaf_4 and leaf_5 sequences to be retrieved and output in tab-delimited format as follows:

>leaf_2    ATGGAGAAGGAAATGAAGAGCAGT
>leaf_4    CGCGGAGTAGATCAGTTTGGTA
>leaf_5    AGTAACGGCTTTACAAGAATCAAA

Anyone could provide a perl script for me? Thanks a lot !

perl data database • 1.8k views
ADD COMMENT
3
Entering edit mode
ADD REPLY
0
Entering edit mode

i think this is should be read first given to any one in biostars before asking a question as it will help him to try first ask second

ADD REPLY
0
Entering edit mode
10.9 years ago
Michael 54k

Citing the link Pierre has given: "This is not problem solving, and software engineering/bioinformatics(mine) is entirely about problem solving."

Here is your script:

#!/usr/bin/env perl
use strict;
use warnings;

my %select = ();

## now read your select list into that hash
## we are using a hash, because that is the fastest way to look up things
while (<SELECT>) {
.....
 }
## now parse the fasta file
while (<FASTA>) {
## for every line that 
## starts with a '>', use a regex to get the identifier
## check if key exists in %select
if (exists...) {
   print $it
   }
}

__END___
ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6