Download specific sequences from databases using a Ruby-Script
1
1
Entering edit mode
20 months ago
Theresa ▴ 10

Hello everyone,

I am trying to retrieve fasta-sequences from the KEGG database to prepare a customized database based on KO numbers. As most KO numbers have many linked sequences, I need an automated approach to copy the sequences in a combined fasta-file. For writing the Ruby-script I was following this post: Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?

I adapted the code a bit (Ruby):

#!/usr/bin/ruby
require "rubygems"
require "mechanize"

# fetch gene list for K01505 
agent = Mechanize.new 
page  = agent.get("http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+ko:K01505")
#puts page.title
links = []

This part is running, meaning that the code reaches the correct website. However, I am not sure if the links on the website are stored in an array with "links = []" correctly.

# get links to each gene page
page.links.each do |link|
if link.uri.to_s =~ /dbget-bin/
links << link
 end
#page = link.click
#puts page.uri
end 

Here, I am not 100% sure what the code does, but when I enable "page = link.click" and "puts page.uri" I get the url of each link pasted in my shell, which I thought shows that the script actually enters the links.

# fetch and print out FASTA 
links.each do |link|
url   = "http://www.genome.jp/dbget-bin/www_bget?-f+-n+n+#{link.text}"
fasta = agent.get(url)
puts (fasta/"//pre").inner_text
end 

This part should create a fasta-file with the retrieved sequences. Although, I don't get an error message I also don't get the file. Do I need to add an output-folder or something else?

I hope someone can help me out. Thank you!

KEGG Ruby database • 950 views
ADD COMMENT
0
Entering edit mode
20 months ago
kojix2 ▴ 250

Hi Theresa!

I am surprised that there are still people trying to do bioinformatics with Ruby. Besides me.

Perhaps

if link.uri.to_s =~ /dbget-bin/
  links << link
end

is not working as expected. Because link.uri.to_s does not include dbget-bin.

I think what you are trying to do is scraping. Scraping is an important technique, but it is complicated to perform. Here I recommend using TogoWS.

http://togows.dbcls.jp/

enter image description here

#! /usr/bin/env ruby

require 'open-uri'
require 'json'
require 'optparse'
require 'tty-progressbar'

def ko2genes(koid)
  url = "http://togows.org/entry/kegg-orthology/#{koid}.json"
  tf = URI.open(url)
  ko = JSON.parse(tf.read)
  ko[0]['genes'].map do |k, v|
    v.map do |i|
      k + ':' + i
    end
  end
end

def gene2ntseq(genes, n = 20, interval: 1)
  bar = TTY::ProgressBar.new('gene2ntseq [:bar] :current/:total :percent ET::elapsed ETA::eta :rate/s',
                             total: genes.size)
  genes.each_slice(n).map do |s_genes|
    gs = s_genes.join(',')
    url = "http://togows.org/entry/kegg-genes/#{gs}/ntseq.json"
    tf = URI.open(url)
    ary = JSON.parse(tf.read)
    raise if ary.size != s_genes.size
    result = s_genes.zip(ary)
    sleep(interval)
    bar.advance(n)
    result
  end.flatten(1)
end

opt = OptionParser.new
@n = 20
@interval = 1.0
opt.banner = "Usage: ruby #{$0} [options] <ko>"
opt.on('-n INT', Numeric, 'number of sequences to fetch at one time [20]') { |v| @n = v }
opt.on('-i SEQ', '--interval', Float, 'interval to connect to server (seconds) [1.0]') { |v| @interval = v }
opt.parse!(ARGV)
if ARGV.empty?
  puts opt.help 
  exit
end

# get all genes
genes = ko2genes(ARGV[0]).flatten

# get all seqs
seqs = gene2ntseq(genes, @n, interval: @interval)

# output as FASTA
seqs.each do |n, s|
  puts ">#{n}"
  puts s.scan(/.{1,80}/)
end

Use with caution, as there is a good chance that bugs may remain in the script as it has not been tested.

ruby this_scritp.rb K01505 > k01505.fasta

ruby this_scritp.rb -n 40 K01505 > k01505.fasta # faster

ADD COMMENT
0
Entering edit mode

I think what you are trying to do is scraping. Scraping is an important technique, but it is complicated to perform.

KEGG database bulk downloads require a subscription. While your script may work it may result in perma-ban for IP of user if KEGG folks detect the scraping.

ADD REPLY
0
Entering edit mode

My script uses TogoWS and does not access KEGG. This is what I meant. Please look at the code.

ADD REPLY
0
Entering edit mode

I am not familiar with Ruby or the tool you mention above. Just wanted to point out that any bulk downloads direct/indirect would likely be noticed by KEGG.

ADD REPLY
0
Entering edit mode

Thanks. Your point about caution in using KEGG is correct, but since We are only downloading a small portion of the data, I don't think it is a problem. TogoWS is a tool provided by the Database Center for Life Science. TogoWS caches KEGG, so it does not overload KEGG. Besides, frequent access to TogoWS automatically slows it down, so it is impossibleble to download large amounts of data.

ADD REPLY

Login before adding your answer.

Traffic: 3231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6