Question: How do I make this perl script work to fetch sequences from NCBI using gene symbols?
gravatar for MAPK
2.2 years ago by
MAPK1.7k wrote:

I have a text file called org.txt. I also have this perl script below. This script works if I just have $name="SS1G_03709";, but doesn't work when I want to loop over all gene symbols. I tried to loop over each $name and print the output in test_organism_seqds.fa file, but there seems to be something wrong reading file and in looping step. I am new in perl so I would really appreciate if someone could help me resolve this issue? thanks!




use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/owner/test_organism_seqs.fa') || die "Can't open file:$!";

    my @names = split('\n', $_);
    foreach my $name(@names){

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = '';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";

close $OUT;
eutils perl • 1.2k views
ADD COMMENTlink modified 21 months ago by josev.die20 • written 2.2 years ago by MAPK1.7k

You don't have to use this perl script for retrieving sequences. See this recent comment for inspiration. It should be possible to figure out changes you need in that command line (hints: change database, use efetch -format fasta etc).

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by GenoMax94k

Thanks, so the db can be "gene"? I don't want to fetch all different nucleotides for the given gene symbol, but only gene sequence.

ADD REPLYlink written 2.2 years ago by MAPK1.7k

Your identifiers do not appear to be in the gene database so you can't use that. nuccore is your choice there. Pay attention to the query since you will need to change it unless you want to download entire genome sequences.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by GenoMax94k

@genomax Sorry, but found it a bit tricky. Tried something like this efetch -format fasta -db nuccore -query "SS1G_03709"+AND+srcdb_refseq[PROP]+AND+[GENE]", but won't get anything.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by MAPK1.7k

-query is not a legal parameter for efetch. You need to use use it with esearch first, then pipe the output to efetch. Also, your query string should be changed to indicate that the term SS1G_03709 is the term for [GENE]. With those changes, the command will be:

esearch -db nuccore -query "SS1G_03709[GENE] AND srcdb_refseq[PROP]" | efetch -format acc

Note, I used -format acc with efetch here for brevity. Change it to -fromat fasta for sequence in FASTA format. That said, do you need the sequence of the genomic RefSeq as well? If not, add AND biomol_rna[PROP] to your query and that'll return only RefSeq RNAs.

ADD REPLYlink written 2.2 years ago by vkkodali2.4k
gravatar for josev.die
21 months ago by
josev.die20 wrote:

If you are open to use R, this code will do the work for you.


 #Loop over your gene symbols, fetch the protein sequence and put it into a fasta file
save_AAfasta <- function(IDs, nameFile) {
     for(i in seq(length(IDs))) {
      myterm = paste0(org[1],'[ALL]')
      mysearch = entrez_search(db = 'gene', term = myterm) = entrez_link(dbfrom = 'gene', id = mysearch$ids, db = 'protein')
      prot_id =$links$gene_protein_refseq
      protein_fasta <- entrez_fetch(db="protein", id=prot_id, rettype="fasta")
      write(protein_fasta, file= paste(nameFile, ".fasta", sep = ""), append = TRUE)

Now, you can define a vector with your gene symbols and just call the function:

org = c('SS1G_03709','SS1G_07286','SS1G_06430','SS1G_01676','SS1G_08825',
save_AAfasta(org, "Downloads/myproteins")
ADD COMMENTlink written 21 months ago by josev.die20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2072 users visited in the last hour