Question

Download specific sequences from databases using a Ruby-Script

1

Entering edit mode

20 months ago

Theresa ▴ 10

Hello everyone,

I am trying to retrieve fasta-sequences from the KEGG database to prepare a customized database based on KO numbers. As most KO numbers have many linked sequences, I need an automated approach to copy the sequences in a combined fasta-file. For writing the Ruby-script I was following this post: Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?

I adapted the code a bit (Ruby):

#!/usr/bin/ruby
require "rubygems"
require "mechanize"

# fetch gene list for K01505 
agent = Mechanize.new 
page  = agent.get("http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+ko:K01505")
#puts page.title
links = []

This part is running, meaning that the code reaches the correct website. However, I am not sure if the links on the website are stored in an array with "links = []" correctly.

# get links to each gene page
page.links.each do |link|
if link.uri.to_s =~ /dbget-bin/
links << link
 end
#page = link.click
#puts page.uri
end

Here, I am not 100% sure what the code does, but when I enable "page = link.click" and "puts page.uri" I get the url of each link pasted in my shell, which I thought shows that the script actually enters the links.

# fetch and print out FASTA 
links.each do |link|
url   = "http://www.genome.jp/dbget-bin/www_bget?-f+-n+n+#{link.text}"
fasta = agent.get(url)
puts (fasta/"//pre").inner_text
end

This part should create a fasta-file with the retrieved sequences. Although, I don't get an error message I also don't get the file. Do I need to add an output-folder or something else?

I hope someone can help me out. Thank you!

KEGG Ruby database • 950 views

ADD COMMENT • link updated 20 months ago by GenoMax 141k • written 20 months ago by Theresa ▴ 10

score 0 · Answer 1 · 2022-08-19

Hi Theresa!

I am surprised that there are still people trying to do bioinformatics with Ruby. Besides me.

Perhaps

if link.uri.to_s =~ /dbget-bin/
  links << link
end

is not working as expected. Because link.uri.to_s does not include dbget-bin.

I think what you are trying to do is scraping. Scraping is an important technique, but it is complicated to perform. Here I recommend using TogoWS.

http://togows.dbcls.jp/

enter image description here

#! /usr/bin/env ruby

require 'open-uri'
require 'json'
require 'optparse'
require 'tty-progressbar'

def ko2genes(koid)
  url = "http://togows.org/entry/kegg-orthology/#{koid}.json"
  tf = URI.open(url)
  ko = JSON.parse(tf.read)
  ko[0]['genes'].map do |k, v|
    v.map do |i|
      k + ':' + i
    end
  end
end

def gene2ntseq(genes, n = 20, interval: 1)
  bar = TTY::ProgressBar.new('gene2ntseq [:bar] :current/:total :percent ET::elapsed ETA::eta :rate/s',
                             total: genes.size)
  genes.each_slice(n).map do |s_genes|
    gs = s_genes.join(',')
    url = "http://togows.org/entry/kegg-genes/#{gs}/ntseq.json"
    tf = URI.open(url)
    ary = JSON.parse(tf.read)
    raise if ary.size != s_genes.size
    result = s_genes.zip(ary)
    sleep(interval)
    bar.advance(n)
    result
  end.flatten(1)
end

opt = OptionParser.new
@n = 20
@interval = 1.0
opt.banner = "Usage: ruby #{$0} [options] <ko>"
opt.on('-n INT', Numeric, 'number of sequences to fetch at one time [20]') { |v| @n = v }
opt.on('-i SEQ', '--interval', Float, 'interval to connect to server (seconds) [1.0]') { |v| @interval = v }
opt.parse!(ARGV)
if ARGV.empty?
  puts opt.help 
  exit
end

# get all genes
genes = ko2genes(ARGV[0]).flatten

# get all seqs
seqs = gene2ntseq(genes, @n, interval: @interval)

# output as FASTA
seqs.each do |n, s|
  puts ">#{n}"
  puts s.scan(/.{1,80}/)
end

Use with caution, as there is a good chance that bugs may remain in the script as it has not been tested.

ruby this_scritp.rb K01505 > k01505.fasta

ruby this_scritp.rb -n 40 K01505 > k01505.fasta # faster