Question: How To Bulk Retrieve Dna Fasta Given Gene Uids?
0
gravatar for random26078
6.1 years ago by
random2607820
random2607820 wrote:

I have read the post here Program or script to go from partial protein sequence to DNA and read through the eutilities manual but I am unsure how to use the 'history' parameter with my data which I understand that I need to use because I want to retrieve a huge number of nucleotide sequences.

I have a file of > 800,000 UIDs with one gene UID per line. I want to retrieve the DNA fasta file from NCBI for each UID.

Here is what I have tried. This works on a short list but my list of over 800,000 UIDs is too big. I don't understand where to include the history parameters for my search and retrieval which I believe will allow me to download all of the fasta files I need. Can anyone tell me how I should incorporate this in my script?

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;

# this needs to be a list of EntrezGene unique IDs
my @ids = qw(3974211 1343189 2657315 7871523 6394675 3289281 11593755 11702744 12745110 12932362);

#open my $UIDLIST, '<', 'E:/uid-list.txt';
open my $OUT, '>', 'E:/gene-uid-list.fasta';

my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                       -email => 'ma@b.org',
                                   -db    => 'gene',
                                   -id    => \@ids);

my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                   -email => 'ma@b.org',
                                   -db      => 'nucleotide',
                                   -rettype => 'fasta');

while (my $docsum = $eutil->next_DocSum) {
    # to ensure we grab the right ChrStart information, we grab the Item above
    # it in the Item hierarchy (visible via print_all from the eutil instance).
    my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
    my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

    while (my $sub_item = $item->next_subItem) {
        if (exists $item_data{$sub_item->get_name}) {
            $item_data{$sub_item->get_name} = $sub_item->get_content;
        }
    }
    # check to make sure everything is set
    for my $check (qw(ChrAccVer ChrStart ChrStop)) {
        die "$check not set" unless $item_data{$check};
    }

    my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
    printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                $item_data{ChrStart},
                                                $item_data{ChrStop},
                                                $strand
                                                );

    $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                         -seq_start => $item_data{ChrStart} + 1,
                         -seq_stop  => $item_data{ChrStop} + 1,
                         -strand    => $strand);
    #print $OUT $fetcher->get_Response->content;
}
close $OUT; #close $UIDLIST; exit;
perl gene bioperl eutils • 2.6k views
ADD COMMENTlink modified 6.1 years ago by biostar150 • written 6.1 years ago by random2607820
1

Have you read the Bioperl EUtilities Cookbook? In particular, the section "How do I retrieve a long list of sequences using a query?"

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Neilfws48k

Yes, I have read that section. My problem with using a history is that all of my UIDs are in a file I have locally. I am not retrieving the UIDs through an efetch call.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by random2607820
0
gravatar for random26078
6.1 years ago by
random2607820
random2607820 wrote:

I made some adjustments to my script and have it working. My only problem now is that I keep getting Bad Gateway errors. My revised code is below. Does anyone have a better way to retrieve >800,000 DNA sequences from a list of UID gene ids?

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;

#This script takes as input a list of UIDs, one per line
#and returns the DNA fasta sequence.

my @uid;

open my $OUT, '>', 'E:/gene-uid-list.fasta';
open my $UIDLIST, '<', 'E:/gene-uid-list.txt' or die "Could not open uid list file: $!\n";

while(<$UIDLIST>){
    my $line = $_;
    chomp $line;
    push @uid, $line;
    my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                              -email => 'ma@b.org',
                                             -db    => 'gene',
                                             -id    => \@uid);

 my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                          -email => 'ma@b.org',
                                         -db      => 'nucleotide',
                                          -rettype => 'fasta');

while (my $docsum = $eutil->next_DocSum) {    # to ensure we grab the right ChrStart information, we grab the Item above it in the Item hierarchy (visible via print_all from the eutil instance).
    my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
    my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);    #creates a hash for each key (accession, start and stop) and sets their values to 0.

    while (my $sub_item = $item->next_subItem) {
        if (exists $item_data{$sub_item->get_name}) {
            $item_data{$sub_item->get_name} = $sub_item->get_content;    #gets the accession, chr start and chr stop coordinates and sets the value for the key.
        }
    }
# check to make sure everything is set
    for my $check (qw(ChrAccVer ChrStart ChrStop)) {
        die "$check not set" unless $item_data{$check};
    }

    my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
    printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                     $item_data{ChrStart},
                                                     $item_data{ChrStop},
                                                     $strand);

    $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                             -seq_start => $item_data{ChrStart} + 1,
                              -seq_stop  => $item_data{ChrStop} + 1,
                              -strand    => $strand);
    print $OUT $fetcher->get_Response->content;
}
sleep 1;    #wait one second 
pop @uid;    #discard the current UID off of the array.
}

close $OUT;
close $UIDLIST;

exit;
ADD COMMENTlink modified 6.1 years ago by SES8.3k • written 6.1 years ago by random2607820

I think the idea is to send all of your IDs to the server and retrieve them in batches, but I may be misunderstanding (I'll try to find out if this is true). The code above is sending one ID at a time, so you would still be making >800,000 queries.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by SES8.3k
0
gravatar for biostar
6.1 years ago by
biostar150
biostar150 wrote:

If you have stand-alone BLAST in your computer then you can do it easily using the following command..

blastdbcmd -db database_name -dbtype nucl/proc -entry_batch file_with_uid.txt
ADD COMMENTlink written 6.1 years ago by biostar150

Using the BLAST+ blastdbcmd is a good idea but won't work in my case. I am using protein (gene) UIDs and I need to get nucleotide FASTA sequences that those proteins come from.

ADD REPLYlink written 6.1 years ago by random2607820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1067 users visited in the last hour