Question: error while downloading taxonomy using E-utilities
0
gravatar for Bioinfonext
4 weeks ago by
Bioinfonext150
Korea
Bioinfonext150 wrote:

Hi,

I am using below command to extract bacterial taxonomy by using gene id name in txt file, one gene id PER LINE, but it is showing some error, please advise me how I can resolve it? there are around 60000 gene id in arms.ID.txt file.

I need taxonomy in this format:

AM774418.1      Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium  Halobacterium salinarum R1

LOPV01000533.1  Archaea; Euryarchaeota; Halobacteria; Haloferacales; Haloferacaceae; Haloferax  Haloferax sp. SB29

CP003125.1      Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Geobacillus; Geobacillus thermoleovorans group  Geobacillus thermoleovorans CCB_US3_UF5

BAWP01000033.1  Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Parageobacillus Parageobacillus thermoglucosidasius NBRC 107763


efetch -db nuccore -format gbc -id arsM.ID.txt|xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >arsm.taxonomy



Error:


501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=arsM.ID.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-date' => 'Client-Date',
                                                          'client-warning' => 'Client-Warning'
                                                        },
                                        'client-date' => 'Mon, 22 Jul 2019 14:38:30 GMT'
                                      }, 'HTTP::Headers' ),
                 '_rc' => 501,
                 '_request' => bless( {
                                        '_content' => 'db=nuccore&id=arsM.ID.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST'
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

vi arsM.ID.txt

AAAK03000116.1

AAAL02000001.1

AACD01000079.1

AACS02000002.1

AADV02000002.1
ADD COMMENTlink modified 4 weeks ago by genomax70k • written 4 weeks ago by Bioinfonext150
1

The user should first do a fresh reinstall of EDirect, most importantly running the "./edirect/setup.sh" command at the end of the installation instructions, in order to get all of the Perl modules properly loaded.

Then he should execute the following commands:

  export NCBI_API_KEY=${redacted}

  cat arsM.ID.txt |
  epost -db nucleotide -format acc |
  efetch -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism

This method makes the most efficient use of the server. The only potential issue is that the result will not be sorted in the original order:

  AADV02000002.1    Bacteria; Cyanobacteria; ...; Crocosphaera    Crocosphaera watsonii WH 8501
  AACS02000002.1    Eukaryota; Fungi; ...; Coprinopsis            Coprinopsis cinerea okayama7#130
  AAAL02000001.1    Bacteria; Proteobacteria; ...; Xylella        Xylella fastidiosa Dixon
  AAAK03000116.1    Bacteria; Firmicutes; ...; Enterococcus       Enterococcus faecium DO
  AACD01000079.1    Eukaryota; Fungi; ...; Aspergillus            Aspergillus nidulans FGSC A4

and the -sort argument is not supported by the underlying epost.fcgi server. If he really needs it in the original order, then using a for loop is necessary, though time-consuming and inefficient.

ADD REPLYlink modified 29 days ago by genomax70k • written 29 days ago by DCGenomics320

The user should first do a fresh reinstall of EDirect

Unfortunately this user is using a cluster and is not able to do anything with installed software.

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax70k

Thanks, as I was getting error on HPC server so I installed EDirect on iMac from this link: E-Direct: https://www.ncbi.nlm.nih.gov/books/NBK179288/

and I tried below command to extract taxonomy and got some error:

$ cat id.txt | epost -db nuccore | efetch -db nuccore -format gbc >taxonomy.list

WebEnv value not found in post output

WebEnv value not found in fetch input

But I will try with API_KEY bash script as you shared if it can work.

Thanks

ADD REPLYlink written 28 days ago by Bioinfonext150

I did used this script but it is showing some error:

script:

#!/bin/bash
export NCBI_API_KEY=${redacted}

  cat id.txt |
  epost -db nucleotide -format acc |
  efetch -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism


bash arsM.sh 
WebEnv value not found in post output
Db value not found in fetch input
ADD REPLYlink written 28 days ago by Bioinfonext150

Did you follow the install instructions fully? Especially this part ./edirect/setup.sh.

Can you use my script below and see what it produces?

ADD REPLYlink modified 28 days ago • written 28 days ago by genomax70k

error is

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)

answer is:

https://stackoverflow.com/questions/21123620

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum122k

Have you signed up for NCBI_API_KEY and are using it? If you are using a long list of queries NCBI may be limiting number of your queries.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax70k

thanks, I put API_key, at the end of the command: but still showing same error:

efetch -db nuccore -format gbc -id arsM.ID.txt|xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >arsm.taxonomy&api_key=${redacted}

I am working on HPC, What module should I load to resolve this issue?

Thanks bioinfonext

ADD REPLYlink modified 4 weeks ago by genomax70k • written 4 weeks ago by Bioinfonext150
1

That is the wrong position to put the API key. I suggest that you export it as a variable in your shell session (or permanently in your .bashrc or .profile). Do export NCBI_API_KEY=your_key and then run the command only containing output file name up to > arsm.taxonomy.

I assume the error about https is not critical since you have used ncbi eutils on this machine successfully before?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax70k

thanks, let me try to run this as a bash script.

bioinfonext

ADD REPLYlink written 4 weeks ago by Bioinfonext150

Hi,

I am trying to run above command with this scrip but it is not running: could you please advise if there is any error in the script:

#!/bin/bash
API_KEY="redacted"

#SBATCH --job-name=taxonomy
#SBATCH –-ntasks=10
#SBATCH --partition=lowpri
#SBATCH --time=80:30:00
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon/armM_Amplicon
module load e-utilities/03.02.19

efetch -db nuccore -format gbc -id arsM.ID.txt |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY} >arsm.taxonomy

list of geneID:

vi arsM.ID.txt

AAAK03000116.1

AAAL02000001.1

AACD01000079.1

AACS02000002.1

AADV02000002.1
ADD REPLYlink modified 4 weeks ago by genomax70k • written 4 weeks ago by Bioinfonext150

I see you guys tracking the API key, but from my own experience I can share that the LWP issue is critical, see Pierre's first comment. On your HPC, you might need to load some additional Perl module containing LWP

ADD REPLYlink written 4 weeks ago by Carambakaracho1.5k
1

Presumably eutils has worked on this cluster based on past questions posted by this poster. We are going with that premise.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax70k

fair enough, taking the full user profile into account is first class service - kudos!

ADD REPLYlink written 4 weeks ago by Carambakaracho1.5k
2
gravatar for genomax
4 weeks ago by
genomax70k
United States
genomax70k wrote:

Bioinfonext : I finally looked at the command you were using. You can't provide a file with list of id's to the -id option. Those id's need to be provided one at a time. You will need to do something like this (which works fine):

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
AAAK03000116.1  Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus   Enterococcus faecium DO
AAAL02000001.1  Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xylella   Xylella fastidiosa Dixon
AACD01000079.1  Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae; Aspergillus    Aspergillus nidulans FGSC A4
AACS02000002.1  Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Psathyrellaceae; Coprinopsis   Coprinopsis cinerea okayama7#130
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by genomax70k

Hi genomax,

If I run same command like you on server it is showing the same error as above:

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AAAK03000116.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_request' => bless( {
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST',
                                        '_content' => 'db=nuccore&id=AAAK03000116.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' )
                                      }, 'HTTP::Request' ),
                 '_rc' => 501,
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'content-type' => 'text/plain',
                                        'client-date' => 'Tue, 23 Jul 2019 08:52:30 GMT',
                                        'client-warning' => 'Internal response'
                                      }, 'HTTP::Headers' ),
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
'
               }, 'HTTP::Response' );

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AAAL02000001.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is

and If RUN IT as bash script like below then I am getting some other error: I am not sure whether is there a need put slash after api_key in for loop?

bash taxonomy.sh 
runtime: failed to create new OS thread (have 4 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
/opt/apps/e-utilities/edirect/efetch: fork: retry: No child processes
taxonomy.sh: fork: retry: Resource temporarily unavailable

runtime stack:
runtime.throw(0x5cc863, 0x9)
    /usr/local/go/src/runtime/panic.go:608 +0x72
runtime.newosproc(0xc000010a80)
    /usr/local/go/src/runtime/os_linux.go:166 +0x1c0

here is the bash script:

#!/bin/bash
API_KEY="redacted"


for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY} >>arsm.taxonomy.txt

done

thanks a lot for your time and help.

ADD REPLYlink written 4 weeks ago by Bioinfonext150

Can you confirm that you are successfully able to use eutils on this machine? Does your cluster have direct internet access from the server/node you are running this from since that is obviously needed.

As you can see, I don't have a problem getting the loop to work.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax70k

Did you use correct file name in your command? id.txt is something I had made up on my machine with a small list of id's.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax70k

with small list also it is showing error:

$ for i in 'small.id.txt' ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=small.id.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=305@login4.pri.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_headers' => bless( {
                                        'client-date' => 'Tue, 23 Jul 2019 10:02:46 GMT',
                                        '::std_case' => {
                                                          'client-date' => 'Client-Date',
                                                          'client-warning' => 'Client-Warning'
                                                        },
                                        'client-warning' => 'Internal response',
                                        'content-type' => 'text/plain'
                                      }, 'HTTP::Headers' ),
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_request' => bless( {
                                        '_method' => 'POST',
                                        '_content' => 'db=nuccore&id=small.id.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Bioinfonext150

Perhaps something changed (an operating system update?) that appears to have broken your LWP protocol). I will point you back to Pierre's answer: C: error while downloading taxonomy using E-utilities

You will need to talk with your sys admins to get this fixed.

ADD REPLYlink written 4 weeks ago by genomax70k

Hi genomax,

Thanks for your all help. Is there any other way to download taxonomy for a large set of geneID from NCBI? E-utilities is having some problem on server. Admin is not proactive to resolve it.

ADD REPLYlink written 4 weeks ago by Bioinfonext150

You could try to create web URL's with the queries though that will not be as convenient as command line eutils. I am not sure if you can combine two methods. Will have to look into it.

ADD REPLYlink written 29 days ago by genomax70k

and If I used below script on HPC server it shows some different error:

script:

#!/bin/bash
API_KEY="{reduced}"


for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY}

done


taxonomy.sh: line 5: efetch: command not found
taxonomy.sh: line 5: xtract: command not found
taxonomy.sh: line 5: efetch: command not found
taxonomy.sh: line 5: xtract: command not found
ADD REPLYlink written 28 days ago by Bioinfonext150

Hi genomax,

this works perfectly on iMAC, now I am using it for large set of geneID file and it is downloading, let see if can download taxonomy for all geneID.

$ for i in `cat id.download.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
AAAK03000116.1  Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus   Enterococcus faecium DO
AAAL02000001.1  Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xylella   Xylella fastidiosa Dixon
AACD01000079.1  Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae; Aspergillus    Aspergillus nidulans FGSC A4
AACS02000002.1  Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Psathyrellaceae; Coprinopsis   Coprinopsis cinerea okayama7#130
ADD REPLYlink modified 28 days ago • written 28 days ago by Bioinfonext150

Should be able to. Then transfer the file to your cluster.

ADD REPLYlink written 28 days ago by genomax70k

Hi genomax,

it has extracted taxonomy for 20000 geneID OUT of 82696 genes and then it throw some error:

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >>taxonomy.txt; done
502 Bad Gateway
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CNGY01000043.1&rettype=gbc&retmode=xml&edirect_os=darwin&edirect=11.7&tool=edirect&email=ygupta@admins-imac.sobs.qub.ac.uk'
Result of do_post http request is
$VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_msg' => 'Bad Gateway',
                 '_request' => bless( {
                                        '_headers' => bless( {
                                                               'user-agent' => 'libwww-perl/6.05',
                                                               '::std_case' => {
                                                                                 'if-ssl-cert-subject' => 'If-SSL-Cert-Subject'
                                                                               },
                                                               'content-type' => 'application/x-www-form-urlencoded'
                                                             }, 'HTTP::Headers' ),
                                        '_uri_canonical' => bless( do{\(my $o = 'h

thanks for your all help bioinfonext

ADD REPLYlink written 27 days ago by Bioinfonext150

Hi,

thanks a lot for all of your help, I have download taxonomy successfully for the sequences as suggested by command from genomax.

thanks a again for such great platform for generous help.

Kind Regards Bioinfonext

ADD REPLYlink modified 4 days ago • written 27 days ago by Bioinfonext150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour