Question: How do I make this perl script work to fetch sequences from NCBI using gene symbols?
0
gravatar for Genetics
13 months ago by
Genetics1.5k
United States
Genetics1.5k wrote:

I have a text file called org.txt. I also have this perl script below. This script works if I just have $name="SS1G_03709";, but doesn't work when I want to loop over all gene symbols. I tried to loop over each $name and print the output in test_organism_seqds.fa file, but there seems to be something wrong reading file and in looping step. I am new in perl so I would really appreciate if someone could help me resolve this issue? thanks!

org.txt

SS1G_03709
SS1G_07286
SS1G_06430
SS1G_01676
SS1G_08825
SS1G_01347

code:

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/owner/test_organism_seqs.fa') || die "Can't open file:$!";

open(INFILE,"</org.txt>){
    chomp;
    my @names = split('\n', $_);
    foreach my $name(@names){

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";
}
}

close $OUT;
exit;
eutils perl • 641 views
ADD COMMENTlink modified 7 months ago by josev.die10 • written 13 months ago by Genetics1.5k
1

You don't have to use this perl script for retrieving sequences. See this recent comment for inspiration. It should be possible to figure out changes you need in that command line (hints: change database, use efetch -format fasta etc).

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax74k

Thanks, so the db can be "gene"? I don't want to fetch all different nucleotides for the given gene symbol, but only gene sequence.

ADD REPLYlink written 13 months ago by Genetics1.5k
1

Your identifiers do not appear to be in the gene database so you can't use that. nuccore is your choice there. Pay attention to the query since you will need to change it unless you want to download entire genome sequences.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax74k

@genomax Sorry, but found it a bit tricky. Tried something like this efetch -format fasta -db nuccore -query "SS1G_03709"+AND+srcdb_refseq[PROP]+AND+[GENE]", but won't get anything.

ADD REPLYlink modified 13 months ago • written 13 months ago by Genetics1.5k
1

-query is not a legal parameter for efetch. You need to use use it with esearch first, then pipe the output to efetch. Also, your query string should be changed to indicate that the term SS1G_03709 is the term for [GENE]. With those changes, the command will be:

esearch -db nuccore -query "SS1G_03709[GENE] AND srcdb_refseq[PROP]" | efetch -format acc
NW_001820832.1
XM_001595570.1

Note, I used -format acc with efetch here for brevity. Change it to -fromat fasta for sequence in FASTA format. That said, do you need the sequence of the genomic RefSeq as well? If not, add AND biomol_rna[PROP] to your query and that'll return only RefSeq RNAs.

ADD REPLYlink written 12 months ago by vkkodali1.4k
0
gravatar for josev.die
7 months ago by
josev.die10
josev.die10 wrote:

If you are open to use R, this code will do the work for you.

#Dependency
library(rentrez)

 #Loop over your gene symbols, fetch the protein sequence and put it into a fasta file
save_AAfasta <- function(IDs, nameFile) {
     for(i in seq(length(IDs))) {
      myterm = paste0(org[1],'[ALL]')
      mysearch = entrez_search(db = 'gene', term = myterm)
      prot.link = entrez_link(dbfrom = 'gene', id = mysearch$ids, db = 'protein')
      prot_id = prot.link$links$gene_protein_refseq
      protein_fasta <- entrez_fetch(db="protein", id=prot_id, rettype="fasta")
      write(protein_fasta, file= paste(nameFile, ".fasta", sep = ""), append = TRUE)
     }
  }

Now, you can define a vector with your gene symbols and just call the function:

org = c('SS1G_03709','SS1G_07286','SS1G_06430','SS1G_01676','SS1G_08825',
    'SS1G_01347')
save_AAfasta(org, "Downloads/myproteins")
ADD COMMENTlink written 7 months ago by josev.die10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2075 users visited in the last hour