1
1
Entering edit mode
7 weeks ago
Theresa ▴ 10

Hello everyone,

I am trying to retrieve fasta-sequences from the KEGG database to prepare a customized database based on KO numbers. As most KO numbers have many linked sequences, I need an automated approach to copy the sequences in a combined fasta-file. For writing the Ruby-script I was following this post: Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?

I adapted the code a bit (Ruby):

#!/usr/bin/ruby
require "rubygems"
require "mechanize"

# fetch gene list for K01505
agent = Mechanize.new
#puts page.title


This part is running, meaning that the code reaches the correct website. However, I am not sure if the links on the website are stored in an array with "links = []" correctly.

# get links to each gene page
end
#puts page.uri
end


Here, I am not 100% sure what the code does, but when I enable "page = link.click" and "puts page.uri" I get the url of each link pasted in my shell, which I thought shows that the script actually enters the links.

# fetch and print out FASTA
fasta = agent.get(url)
puts (fasta/"//pre").inner_text
end


This part should create a fasta-file with the retrieved sequences. Although, I don't get an error message I also don't get the file. Do I need to add an output-folder or something else?

I hope someone can help me out. Thank you!

KEGG Ruby database • 365 views
0
Entering edit mode
5 weeks ago
kojix2 ▴ 220

Hi Theresa!

I am surprised that there are still people trying to do bioinformatics with Ruby. Besides me.

Perhaps

if link.uri.to_s =~ /dbget-bin/
end


is not working as expected. Because link.uri.to_s does not include dbget-bin.

I think what you are trying to do is scraping. Scraping is an important technique, but it is complicated to perform. Here I recommend using TogoWS.

http://togows.dbcls.jp/

#! /usr/bin/env ruby

require 'open-uri'
require 'json'
require 'optparse'
require 'tty-progressbar'

def ko2genes(koid)
url = "http://togows.org/entry/kegg-orthology/#{koid}.json"
tf = URI.open(url)
ko[0]['genes'].map do |k, v|
v.map do |i|
k + ':' + i
end
end
end

def gene2ntseq(genes, n = 20, interval: 1)
bar = TTY::ProgressBar.new('gene2ntseq [:bar] :current/:total :percent ET::elapsed ETA::eta :rate/s',
total: genes.size)
genes.each_slice(n).map do |s_genes|
gs = s_genes.join(',')
url = "http://togows.org/entry/kegg-genes/#{gs}/ntseq.json"
tf = URI.open(url)
raise if ary.size != s_genes.size
result = s_genes.zip(ary)
sleep(interval)
result
end.flatten(1)
end

opt = OptionParser.new
@n = 20
@interval = 1.0
opt.banner = "Usage: ruby #{\$0} [options] <ko>"
opt.on('-n INT', Numeric, 'number of sequences to fetch at one time [20]') { |v| @n = v }
opt.on('-i SEQ', '--interval', Float, 'interval to connect to server (seconds) [1.0]') { |v| @interval = v }
opt.parse!(ARGV)
if ARGV.empty?
puts opt.help
exit
end

# get all genes
genes = ko2genes(ARGV[0]).flatten

# get all seqs
seqs = gene2ntseq(genes, @n, interval: @interval)

# output as FASTA
seqs.each do |n, s|
puts ">#{n}"
puts s.scan(/.{1,80}/)
end


Use with caution, as there is a good chance that bugs may remain in the script as it has not been tested.

ruby this_scritp.rb K01505 > k01505.fasta

ruby this_scritp.rb -n 40 K01505 > k01505.fasta # faster

0
Entering edit mode

I think what you are trying to do is scraping. Scraping is an important technique, but it is complicated to perform.

KEGG database bulk downloads require a subscription. While your script may work it may result in perma-ban for IP of user if KEGG folks detect the scraping.

0
Entering edit mode

My script uses TogoWS and does not access KEGG. This is what I meant. Please look at the code.

0
Entering edit mode

I am not familiar with Ruby or the tool you mention above. Just wanted to point out that any bulk downloads direct/indirect would likely be noticed by KEGG.

0
Entering edit mode

Thanks. Your point about caution in using KEGG is correct, but since We are only downloading a small portion of the data, I don't think it is a problem. TogoWS is a tool provided by the Database Center for Life Science. TogoWS caches KEGG, so it does not overload KEGG. Besides, frequent access to TogoWS automatically slows it down, so it is impossibleble to download large amounts of data.