Question: kraken: unable to download the databases from ncbi
0
gravatar for karthic
5 months ago by
karthic0
karthic0 wrote:

Hi All,

After installing kraken am trying to build the database as specified in the manaul but getting the following messages. Any inputs on this??

/Tools/kraken-master/KRAKEN$ ./kraken-build --standard --threads 40 --db /home/karthic/Databases/KRAKEN
Found jellyfish v1.1.11
Step 1/3: performing rsync dry run...
Rsync dry run complete, removing any non-existent files from manifest.
Step 2/3: Performing rsync file transfer of requested files
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (165.112.9.229): Connection timed out (110)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::7): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.1]
rsync_from_ncbi.pl: rsync error, exited with code 10

Thanks in Advance, KK

ADD COMMENTlink modified 5 months ago by Joseph Hughes2.6k • written 5 months ago by karthic0

You are probably behind a firewall/proxy and kraken is not able to reach NCBI via rsync. If that is the case you may want to talk with your local sys admins. There are solutions but they will depend on your local setup.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax52k

Are you able to download anything from the NCBI ftp server using wget?

ADD REPLYlink written 5 months ago by dylan.lawrence10

yes i could do with wget

ADD REPLYlink written 5 months ago by karthic0

Hi,

I was hitting the same rsync error. The way I got around it was to change the rsync_from_ncbi.pl script to use wget instead. I changed line 70 from:

if (system("rsync --no-motd --files-from=manifest.txt rsync://ftp.ncbi.nlm.nih.gov/genomes/ .") != 0) {

to

if (system("wget -nc -nH -x --cut-dirs=1 -i manifest.txt -B ftp://ftp.ncbi.nlm.nih.gov/genomes/ .") != 0) {

It worked okay once I managed to get wget to behave in the the same way as the rsync command. I don't know how it will affect database updates. I was creating a new one when I ran into this error. Good Luck!

ADD REPLYlink written 4 months ago by dereksarovich0
1
gravatar for Joseph Hughes
5 months ago by
Joseph Hughes2.6k
Scotland, UK
Joseph Hughes2.6k wrote:

Since NCBI updated their FTP website and decided to phase-out Genbank Identifiers (GIs), the default Kraken database update scripts do not work.

My colleague @Sej Modha has written a python script that helps with updating the kraken databases: http://bioinformatics.cvr.ac.uk/blog/update-kraken-databases/

ADD COMMENTlink modified 5 months ago • written 5 months ago by Joseph Hughes2.6k

Good to know. Has this been raised as an issue with kraken developers?

ADD REPLYlink written 5 months ago by genomax52k

I believe Derrick Wood, kraken developer, has moved on to pastures new.

ADD REPLYlink written 5 months ago by Joseph Hughes2.6k

Hi Joseph,

I tried the script but it is not working. Getting the following error..

/Tools/kraken-master$ python Update_kraken_db.py File "Update_kraken_db.py", line 18 if len(sys.argv) > 1: ^

ADD REPLYlink written 5 months ago by karthic0
1

Hi Karthic,

There is something wrong with the code formatting on the WordPress, code formatting plugin has changed the code on line 18.

Please download the script from the github and try again, let me know if there are any problems.

ADD REPLYlink modified 5 months ago • written 5 months ago by Sej Modha3.1k

Hey Sej,

Thank you for the solution. The script is working.

Regards, KK

ADD REPLYlink written 5 months ago by karthic0

Hello Sed Modha, I have been using your script but at some point the following error appears:

sys:1: DtypeWarning: Columns (20) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "./UpdateKrakenDatabases.py", line 118, in <module>
    get_fasta_in_kraken_format('human_genome.fa')
  File "./UpdateKrakenDatabases.py", line 98, in get_fasta_in_kraken_format
    for seq_record in records:
  File "/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/SeqIO/__init__.py", line 600, in parse
    for r in i:
  File "/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py", line 478, in parse_records
    record = self.parse(handle, do_features)
  File "/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py", line 462, in parse
    if self.feed(handle, consumer, do_features):
  File "/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py", line 430, in feed
    self._feed_header_lines(consumer, self.parse_header())
  File "/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py", line 1436, in _feed_header_lines
    structured_comment_key = re.search(r"([^#]+){0}$".format(STRUCTURED_COMMENT_START), data).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

Any help?

ADD REPLYlink modified 4 months ago • written 4 months ago by guillepalou40

Hi there, I have updated the script to explicitly specify the dtype, updated version of the script is available to download from the github.

ADD REPLYlink written 4 months ago by Sej Modha3.1k

Thank you for the help!

ADD REPLYlink written 4 months ago by guillepalou40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 521 users visited in the last hour