Question: How do I make this perl script work to fetch sequences from NCBI using gene symbols?
3 months ago
United States
MAPK wrote:

I have a text file called org.txt. I also have this perl script below. This script works if I just have $name="SS1G_03709";, but doesn't work when I want to loop over all gene symbols. I tried to loop over each $name and print the output in test_organism_seqds.fa file, but there seems to be something wrong reading file and in looping step. I am new in perl so I would really appreciate if someone could help me resolve this issue? thanks!




use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/owner/test_organism_seqs.fa') || die "Can't open file:$!";

    my @names = split('\n', $_);
    foreach my $name(@names){

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = '';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";

close $OUT;
modified 3 months ago by genomax • written 3 months ago by MAPK

You don't have to use this perl script for retrieving sequences. See this recent comment for inspiration. It should be possible to figure out changes you need in that command line (hints: change database, use efetch -format fasta etc).

modified 3 months ago • written 3 months ago by genomax

Thanks, so the db can be "gene"? I don't want to fetch all different nucleotides for the given gene symbol, but only gene sequence.

written 3 months ago by MAPK

Your identifiers do not appear to be in the gene database so you can't use that. nuccore is your choice there. Pay attention to the query since you will need to change it unless you want to download entire genome sequences.

modified 3 months ago • written 3 months ago by genomax

@genomax Sorry, but found it a bit tricky. Tried something like this efetch -format fasta -db nuccore -query "SS1G_03709"+AND+srcdb_refseq[PROP]+AND+[GENE]", but won't get anything.

modified 3 months ago • written 3 months ago by MAPK

-query is not a legal parameter for efetch. You need to use use it with esearch first, then pipe the output to efetch. Also, your query string should be changed to indicate that the term SS1G_03709 is the term for [GENE]. With those changes, the command will be:

esearch -db nuccore -query "SS1G_03709[GENE] AND srcdb_refseq[PROP]" | efetch -format acc

Note, I used -format acc with efetch here for brevity. Change it to -fromat fasta for sequence in FASTA format. That said, do you need the sequence of the genomic RefSeq as well? If not, add AND biomol_rna[PROP] to your query and that'll return only RefSeq RNAs.

written 3 months ago by vkkodali
