Question

error while downloading taxonomy using E-utilities

0

Entering edit mode

4.8 years ago

Bioinfonext ▴ 460

Hi,

I am using below command to extract bacterial taxonomy by using gene id name in txt file, one gene id PER LINE, but it is showing some error, please advise me how I can resolve it? there are around 60000 gene id in arms.ID.txt file.

I need taxonomy in this format:

AM774418.1      Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium  Halobacterium salinarum R1

LOPV01000533.1  Archaea; Euryarchaeota; Halobacteria; Haloferacales; Haloferacaceae; Haloferax  Haloferax sp. SB29

CP003125.1      Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Geobacillus; Geobacillus thermoleovorans group  Geobacillus thermoleovorans CCB_US3_UF5

BAWP01000033.1  Bacteria; Firmicutes; Bacilli; Bacillales; Bacillaceae; Parageobacillus Parageobacillus thermoglucosidasius NBRC 107763


efetch -db nuccore -format gbc -id arsM.ID.txt|xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >arsm.taxonomy



Error:


501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=arsM.ID.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-date' => 'Client-Date',
                                                          'client-warning' => 'Client-Warning'
                                                        },
                                        'client-date' => 'Mon, 22 Jul 2019 14:38:30 GMT'
                                      }, 'HTTP::Headers' ),
                 '_rc' => 501,
                 '_request' => bless( {
                                        '_content' => 'db=nuccore&id=arsM.ID.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST'
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

vi arsM.ID.txt

AAAK03000116.1

AAAL02000001.1

AACD01000079.1

AACS02000002.1

AADV02000002.1

software error bash ncbi amplicon sequencing • 2.9k views

ADD COMMENT • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by Bioinfonext ▴ 460

1

Entering edit mode

error is

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)

answer is:

https://stackoverflow.com/questions/21123620

ADD REPLY • link 4.8 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

The user should first do a fresh reinstall of EDirect, most importantly running the "./edirect/setup.sh" command at the end of the installation instructions, in order to get all of the Perl modules properly loaded.

Then he should execute the following commands:

  export NCBI_API_KEY=${redacted}

  cat arsM.ID.txt |
  epost -db nucleotide -format acc |
  efetch -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism

This method makes the most efficient use of the server. The only potential issue is that the result will not be sorted in the original order:

  AADV02000002.1    Bacteria; Cyanobacteria; ...; Crocosphaera    Crocosphaera watsonii WH 8501
  AACS02000002.1    Eukaryota; Fungi; ...; Coprinopsis            Coprinopsis cinerea okayama7#130
  AAAL02000001.1    Bacteria; Proteobacteria; ...; Xylella        Xylella fastidiosa Dixon
  AAAK03000116.1    Bacteria; Firmicutes; ...; Enterococcus       Enterococcus faecium DO
  AACD01000079.1    Eukaryota; Fungi; ...; Aspergillus            Aspergillus nidulans FGSC A4

and the -sort argument is not supported by the underlying epost.fcgi server. If he really needs it in the original order, then using a for loop is necessary, though time-consuming and inefficient.

ADD REPLY • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by DCGenomics ▴ 330

1

Entering edit mode

Thanks, as I was getting error on HPC server so I installed EDirect on iMac from this link: E-Direct: https://www.ncbi.nlm.nih.gov/books/NBK179288/

and I tried below command to extract taxonomy and got some error:

$ cat id.txt | epost -db nuccore | efetch -db nuccore -format gbc >taxonomy.list

WebEnv value not found in post output

WebEnv value not found in fetch input

But I will try with API_KEY bash script as you shared if it can work.

Thanks

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I did used this script but it is showing some error:

script:

#!/bin/bash
export NCBI_API_KEY=${redacted}

  cat id.txt |
  epost -db nucleotide -format acc |
  efetch -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism


bash arsM.sh 
WebEnv value not found in post output
Db value not found in fetch input

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Did you follow the install instructions fully? Especially this part ./edirect/setup.sh.

Can you use my script below and see what it produces?

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

The user should first do a fresh reinstall of EDirect

Unfortunately this user is using a cluster and is not able to do anything with installed software.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Have you signed up for NCBI_API_KEY and are using it? If you are using a long list of queries NCBI may be limiting number of your queries.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

thanks, I put API_key, at the end of the command: but still showing same error:

efetch -db nuccore -format gbc -id arsM.ID.txt|xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >arsm.taxonomy&api_key=${redacted}

I am working on HPC, What module should I load to resolve this issue?

Thanks bioinfonext

ADD REPLY • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by Bioinfonext ▴ 460

1

Entering edit mode

That is the wrong position to put the API key. I suggest that you export it as a variable in your shell session (or permanently in your .bashrc or .profile). Do export NCBI_API_KEY=your_key and then run the command only containing output file name up to > arsm.taxonomy.

I assume the error about https is not critical since you have used ncbi eutils on this machine successfully before?

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

thanks, let me try to run this as a bash script.

bioinfonext

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi,

I am trying to run above command with this scrip but it is not running: could you please advise if there is any error in the script:

#!/bin/bash
API_KEY="redacted"

#SBATCH --job-name=taxonomy
#SBATCH –-ntasks=10
#SBATCH --partition=lowpri
#SBATCH --time=80:30:00
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon/armM_Amplicon
module load e-utilities/03.02.19

efetch -db nuccore -format gbc -id arsM.ID.txt |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY} >arsm.taxonomy

list of geneID:

vi arsM.ID.txt

AAAK03000116.1

AAAL02000001.1

AACD01000079.1

AACS02000002.1

AADV02000002.1

ADD REPLY • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I see you guys tracking the API key, but from my own experience I can share that the LWP issue is critical, see Pierre's first comment. On your HPC, you might need to load some additional Perl module containing LWP

ADD REPLY • link 4.8 years ago by Carambakaracho ★ 3.2k

1

Entering edit mode

Presumably eutils has worked on this cluster based on past questions posted by this poster. We are going with that premise.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

fair enough, taking the full user profile into account is first class service - kudos!

ADD REPLY • link 4.8 years ago by Carambakaracho ★ 3.2k

score 2 · Accepted Answer · 2019-07-22

2

Entering edit mode

4.8 years ago

GenoMax 141k

Bioinfonext : I finally looked at the command you were using. You can't provide a file with list of id's to the -id option. Those id's need to be provided one at a time. You will need to do something like this (which works fine):

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
AAAK03000116.1  Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus   Enterococcus faecium DO
AAAL02000001.1  Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xylella   Xylella fastidiosa Dixon
AACD01000079.1  Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae; Aspergillus    Aspergillus nidulans FGSC A4
AACS02000002.1  Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Psathyrellaceae; Coprinopsis   Coprinopsis cinerea okayama7#130

ADD COMMENT • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

If I run same command like you on server it is showing the same error as above:

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AAAK03000116.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_request' => bless( {
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST',
                                        '_content' => 'db=nuccore&id=AAAK03000116.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' )
                                      }, 'HTTP::Request' ),
                 '_rc' => 501,
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'content-type' => 'text/plain',
                                        'client-date' => 'Tue, 23 Jul 2019 08:52:30 GMT',
                                        'client-warning' => 'Internal response'
                                      }, 'HTTP::Headers' ),
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
'
               }, 'HTTP::Response' );

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AAAL02000001.1&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network'
Result of do_post http request is

and If RUN IT as bash script like below then I am getting some other error: I am not sure whether is there a need put slash after api_key in for loop?

bash taxonomy.sh 
runtime: failed to create new OS thread (have 4 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
/opt/apps/e-utilities/edirect/efetch: fork: retry: No child processes
taxonomy.sh: fork: retry: Resource temporarily unavailable

runtime stack:
runtime.throw(0x5cc863, 0x9)
    /usr/local/go/src/runtime/panic.go:608 +0x72
runtime.newosproc(0xc000010a80)
    /usr/local/go/src/runtime/os_linux.go:166 +0x1c0

here is the bash script:

#!/bin/bash
API_KEY="redacted"


for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY} >>arsm.taxonomy.txt

done

thanks a lot for your time and help.

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Can you confirm that you are successfully able to use eutils on this machine? Does your cluster have direct internet access from the server/node you are running this from since that is obviously needed.

As you can see, I don't have a problem getting the loop to work.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Did you use correct file name in your command? id.txt is something I had made up on my machine with a small list of id's.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

with small list also it is showing error:

$ for i in 'small.id.txt' ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=small.id.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=305@login4.pri.alces.network'
Result of do_post http request is
$VAR1 = bless( {
                 '_headers' => bless( {
                                        'client-date' => 'Tue, 23 Jul 2019 10:02:46 GMT',
                                        '::std_case' => {
                                                          'client-date' => 'Client-Date',
                                                          'client-warning' => 'Client-Warning'
                                                        },
                                        'client-warning' => 'Internal response',
                                        'content-type' => 'text/plain'
                                      }, 'HTTP::Headers' ),
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_request' => bless( {
                                        '_method' => 'POST',
                                        '_content' => 'db=nuccore&id=small.id.txt&rettype=gbc&retmode=xml&edirect_os=linux&edirect=11.7&tool=edirect&email=3052771@login4.pri.kelvin2.alces.network',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.39'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Perhaps something changed (an operating system update?) that appears to have broken your LWP protocol). I will point you back to Pierre's answer: C: error while downloading taxonomy using E-utilities

You will need to talk with your sys admins to get this fixed.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

Thanks for your all help. Is there any other way to download taxonomy for a large set of geneID from NCBI? E-utilities is having some problem on server. Admin is not proactive to resolve it.

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

You could try to create web URL's with the queries though that will not be as convenient as command line eutils. I am not sure if you can combine two methods. Will have to look into it.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

and If I used below script on HPC server it shows some different error:

script:

#!/bin/bash
API_KEY="{reduced}"


for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism&api_key=${API_KEY}

done


taxonomy.sh: line 5: efetch: command not found
taxonomy.sh: line 5: xtract: command not found
taxonomy.sh: line 5: efetch: command not found
taxonomy.sh: line 5: xtract: command not found

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi genomax,

this works perfectly on iMAC, now I am using it for large set of geneID file and it is downloading, let see if can download taxonomy for all geneID.

$ for i in `cat id.download.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism; done
AAAK03000116.1  Bacteria; Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; Enterococcus   Enterococcus faecium DO
AAAL02000001.1  Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xylella   Xylella fastidiosa Dixon
AACD01000079.1  Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Aspergillaceae; Aspergillus    Aspergillus nidulans FGSC A4
AACS02000002.1  Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Psathyrellaceae; Coprinopsis   Coprinopsis cinerea okayama7#130

ADD REPLY • link 4.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Should be able to. Then transfer the file to your cluster.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

it has extracted taxonomy for 20000 geneID OUT of 82696 genes and then it throw some error:

$ for i in `cat id.txt` ; do efetch -db nuccore -format gbc -id ${i} |xtract  -pattern INSDSeq -element INSDSeq_accession-version INSDSeq_taxonomy INSDSeq_organism >>taxonomy.txt; done
502 Bad Gateway
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CNGY01000043.1&rettype=gbc&retmode=xml&edirect_os=darwin&edirect=11.7&tool=edirect&email=ygupta@admins-imac.sobs.qub.ac.uk'
Result of do_post http request is
$VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_msg' => 'Bad Gateway',
                 '_request' => bless( {
                                        '_headers' => bless( {
                                                               'user-agent' => 'libwww-perl/6.05',
                                                               '::std_case' => {
                                                                                 'if-ssl-cert-subject' => 'If-SSL-Cert-Subject'
                                                                               },
                                                               'content-type' => 'application/x-www-form-urlencoded'
                                                             }, 'HTTP::Headers' ),
                                        '_uri_canonical' => bless( do{\(my $o = 'h

thanks for your all help bioinfonext

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi,

thanks a lot for all of your help, I have download taxonomy successfully for the sequences as suggested by command from genomax.

thanks a again for such great platform for generous help.

Kind Regards Bioinfonext

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 460