Perl Script For Retrieval Of Gene Sequence (In Fasta Format) From Ncbi
4
1
Entering edit mode
13.6 years ago
Vars22 ▴ 20

Hello everybody,

My question is how I can retrieve a gene sequence by providing the geneID as user input from NCBI and then store it in an Excel sheet.

thank you,

regards: vars22

sequence retrieval fasta perl • 14k views
ADD COMMENT
9
Entering edit mode

Something does not make sense to me here. If you want to store the sequence in an Excel sheet (which I think is a bad idea to start with), why would you want the sequence to be in FASTA format?

ADD REPLY
6
Entering edit mode

Answer to second part of question: you almost certainly do not want to store sequence data in Excel. You can prove me wrong by explaining the sequence analysis functionality in Excel that you plan to use.

ADD REPLY
3
Entering edit mode

what did you try so far ?

ADD REPLY
0
Entering edit mode

Additionally, if the goal is to retrieve sequences based on a certain user input, why is it crucial that this is done in Perl?

ADD REPLY
0
Entering edit mode

Might be an idea to output to .csv? Excel can still read it then and easier to parse programmatically?

ADD REPLY
0
Entering edit mode

i only want the sequences if it will be in fasta format i can easily extract the sequences.i have geneIDs with me of which i need sequences. i have used a module lwp and set the path of ncbi but i the result always shows connection timed out.

ADD REPLY
0
Entering edit mode

thanks for the help i m grateful

ADD REPLY
0
Entering edit mode

plz suggest me something in perl not bioperl. the bioperl link you have sent me i have already been there.thanks for the other links.

ADD REPLY
10
Entering edit mode
13.6 years ago
Peter 6.0k

For the first part, use the NCBI Entrez interface (perhaps via BioPerl since you want to use Perl).

For the second part about Excel, As others have said too, my answer is don't do it. For one set of reasons, see: Zeeberg et al (2004) Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

ADD COMMENT
2
Entering edit mode

++ for the reference :)

ADD REPLY
4
Entering edit mode
13.6 years ago

Take a look at these links?

http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequences

http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esummary_-.3E_efetch

Outputting to .csv would be better, if you really want to use Excel to view these. I have used Bio::DB::Fasta and Text::CSV_XS for similar stuff in the past!

You could get the sequence using Bio::DB::EUtilities as described here. I would build a Fasta sequence object from that using Bio::Seq before writing it to a Fasta file using Bio::SeqIO.


If you are determined to look at Excel then these links might be useful?

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/excel.html

http://www.ibm.com/developerworks/linux/library/l-pexcel/

Also, some stuff around page 567 in Bioinformatics: sequence and genome analysis (2nd Ed)

ADD COMMENT
0
Entering edit mode
13.6 years ago
Julien ▴ 160

NCBI provides a script-generation website at:

http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi

Follow the directions on the screen, and it will create a perl script for you. It's verbose and comes with a lot of disclaimers, but it is very easy to modify to your needs.

ADD COMMENT
0
Entering edit mode
13.6 years ago
Vars22 ▴ 20

The code for retrieving gene sequence using gene IDs

#!/usr/bin/perl
use Bio::DB::EUtilities;
use Bio::SeqIO;

system(cls);

print "Enter the input file name: "; #ask user for the file name containing the gene IDs
$filename=[?];                   #take the file name as input
open(INPUT,$filename) or die "Cannot open input file";
@arrayids=[?];                   #store the Gene IDs in an array
close INPUT;

foreach $ea(@arrayids)
{
    chomp($ea);
}

$len=scalar(@arrayids);

for($i=0;$i<$len;$i++)
{
    $fi=$arrayids[$i];

    $file_nam[$i]=join('',$fi,".gbe"); #joins the genbank extension with the gene ID

    open(FILE,">$file_nam[$i]") or die "cannot open $file_nam....";

    my @ids=$arrayids[$i];
    my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                           -email => 'user@xyz.com',
                                           -db    => 'gene',
                                           -id    => \@ids);

    my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                           -db      => 'nucleotide',
                                           -rettype => 'fasta');#returns fasta #sequence you can change the format acc to ur needs

    while (my $docsum = $eutil->next_DocSum) {
        my ($item) = $docsum->get_Items_by_name('GenomicInfoType');

        my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

        while (my $sub_item = $item->next_subItem) {
            if (exists $item_data{$sub_item->get_name}) {
                $item_data{$sub_item->get_name} = $sub_item->get_content;
            }
        }

        for my $check (qw(ChrAccVer ChrStart ChrStop)) {
            die "$check not set" unless $item_data{$check};
        }

        my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
        printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                        $item_data{ChrStart},
                                                        $item_data{ChrStop},
                                                        $strand
                                                        );

        $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                                 -seq_start => $item_data{ChrStart} + 1,
                                 -seq_stop  => $item_data{ChrStop} + 1,
                                 -strand    => $strand);
        print FILE $fetcher->get_Response->content;
    }
    close FILE;
}

for($i=0;$i<$len;$i++)
{
    $fii=$arrayids[$i];
    $old_file_nam[$i]=join('',$fii,".gbe"); 
    $new_file_nam[$i]=join('',$fii,".fasta");
    $in  = Bio::SeqIO->new(-file => "$file_nam[$i]" , '-format' => 'GenBank');
    $out = Bio::SeqIO->new(-file => ">$new_file_nam[$i]" , '-format' => 'Fasta');

    while ( my $seq = $in->next_seq() )
    {
            $out->write_seq($seq);
    }
}

print "the new file names :n@new_file_nam";

for($i=0;$i<$len;$i++)
{
    $temfi=$arrayids[$i];
    $file_del=join('',$temfi,".gbe");
    system("del $file_del");
}
ADD COMMENT

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6