Question: Extract Sequence From Fasta File Using Ids From A Separate File
gravatar for IsmailM
6.7 years ago by
UK/London/University College London
IsmailM110 wrote:

Hi guys,

As shown below, I have two files a fasta file (containing a sequences and sequence IDs) and a text file containing a list of sequence ID.

I would like to extract the respective sequence for each of the sequence IDs in the text file. I have tried a number of times, but I simply cannot even open the fasta file - I'm a total beginner with regards to bioruby - if you could explain any answers in detail, I would be highly grateful.

EDIT: Could you give your answers using Ruby or BioRuby. Thanks





Many Thanks

fasta • 23k views
ADD COMMENTlink modified 2.6 years ago by BPors40 • written 6.7 years ago by IsmailM110
gravatar for IsmailM
4.7 years ago by
UK/London/University College London
IsmailM110 wrote:

This is what I ended up doing - by using 'pure' ruby.

require 'bio'
# create a hash of the Fasta file
def analyse_input_file(input_file)
  input_read = {}
  biofastafile =, input_file)
  biofastafile.each_entry do |entry|
    input_read[entry.entry_id] = entry.seq

# Iterate ID file and look up in hash
def iterate_id_file(id_file, input_read, output_file)
  ids =
  ids.each_line do |id|
    # remove the inital > and the end of line character.
    id = id.gsub!(/^>/, '').chomp
    seq = input_read[id], 'a+') do |f|
      f.puts ">#{id}"
      f.puts seq

# Set up
input_file = ARGV[0] # '/Volumes/Data/input_file.fa'
id_file = ARGV[1] # '/Volumes/Data/ids.txt'
output_file = "#{input_file}.out"

# Set up
input_read = analyse_input_file(input_file)
iterate_id_file(id_file, input_read, output_file)

Just a quick explanation on how the above works... (I know I would have found this really helpful, when I asked the question)

The main concept used here is the HASH (also known as a dictionary).

The basic arrangement of a Hash is as follows

Hash_name = { 'key_1' => 'value_1', 'key_2' => 'value_2'  }

This is also written as:

Hash_name = { key_1: 'value_1', key_2: 'value_2'}

The main benefit of Hashes is that you can quickly call a value by passing the respective key as follows

puts Hash_name[key_1]
# => value_1

puts Hash_name[key_2]
# => value_2

In the analyse_input_file method, we created a hash where the key was the sequence id (i.e. entry.entry_id) and the sequence itself was the value (i.e. entry.seq)

Then in the iterate_id_file, we simple passed the id to the HASH to get the corresponding sequence....


ADD COMMENTlink modified 5 months ago by RamRS26k • written 4.7 years ago by IsmailM110
gravatar for matted
6.7 years ago by
Boston, United States
matted7.2k wrote:

Why does it need to be Ruby?

Here's a quick shell command using samtools (better than a buggy one-off Perl/Python/Ruby script in my opinion...):

cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by matted7.2k

how should we extract the multiple sequence in fasta format by using the gene_id through samtools?

ADD REPLYlink written 3.9 years ago by Bulbul Ahmed20

@Bulbul Did you find the solution?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by bioinfo8120

@matted Would you please elaborate?

ADD REPLYlink written 2.6 years ago by bioinfo8120
gravatar for AGS
6.5 years ago by
Brooklyn, ny
AGS230 wrote:

The best tool for this, imo, is Jim Kent's faSomeRecords.

While the OP wanted a Ruby-based solution, I think listing this answer for future reference is worthwhile.

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
   faSomeRecords in.fa listFile out.fa
   -exclude - output sequences not in the list file.
ADD COMMENTlink written 6.5 years ago by AGS230

Where can I find this software?

ADD REPLYlink written 4.8 years ago by Endre Bakken Stovner880

It's on UCSC

ADD REPLYlink written 4.8 years ago by PoGibas4.8k

ADD REPLYlink written 3.1 years ago by Ian5.6k

you can install it via anaconda: conda install ucsc-fasomerecords

ADD REPLYlink written 2.9 years ago by Ă–mer An200
gravatar for Ashutosh Pandey
6.7 years ago by
Ashutosh Pandey12k wrote:

This post is similar. You may have to tweak your code a little.

ADD COMMENTlink written 6.7 years ago by Ashutosh Pandey12k

Many thanks is it possible to do something similar in ruby???

It just that I don't really know any python

ADD REPLYlink written 6.7 years ago by IsmailM110
gravatar for Fedor Gusev
6.7 years ago by
Fedor Gusev210
Russian Federation
Fedor Gusev210 wrote:

Here is a script I use:

#!/usr/bin/env ruby

require 'bio'
require 'trollop'

opts = Trollop::options do
  opt :get, "Contigs to get", :type => :string, :required => true

get = File.readlines(opts[:get]).map { |x| x.chomp }

Bio::FlatFile.foreach(ARGF) do |entry|
  id = entry.definition.split.first # no idea why I did not use entry.entry_id
  puts entry if get.include?(id)

For your data you can run it like this:

ruby ./fasta-get-contigs -g <(cat example.txt | sed 's/^>//') example.fasta > result.fasta

Or you can remove leading > in the script.

But I second samtools faidx answer from @matted: from my experience, fasta parser in bioruby is too slow.

ADD COMMENTlink written 6.7 years ago by Fedor Gusev210
gravatar for BPors
2.6 years ago by
BPors40 wrote:

If you had only the sequences in the example.txt instead of the identifiers, how would you grab the identifiers and their sequences? ( in this case, the purpose was retrieving the sequences with identifiers, what I am asking is retrieving headers and sequencers with known sequences

ADD COMMENTlink written 2.6 years ago by BPors40

In that case you can use BBMap's in=a.fa ref=b.fa out=filtered.fa

However, the files have to be valid fasta or fastq.

ADD REPLYlink written 2.6 years ago by Brian Bushnell17k
gravatar for always_learning
6.7 years ago by
Doha, Qatar
always_learning1.0k wrote:
           use Data::Dumper;
           open (IN1,"$ARGV[0]");
           open (IN2,"id.txt");
       chomp $_;
       @file= split(">",$_);
    $file[0]=~ s/\s*$//g;
    if ($file[0] ne ''){
    $seq = $file[0];
    $file[1]=~ s/\s*$//g;
     if ($file[1] ne ''){
     $id = $file[1];
$val{$id} = $seq;
        $one = $_;
        $one =~ s/\s*$//g;
        print "$one\n";
        print "\n";
        #print Dumper $one;
        print  "$val{$one}\n";

Run this perl fasta.txt

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by always_learning1.0k

Another Perl solution:


use strict;
use warnings;

$ARGV[2] or die "use LIST FASTA OUT\n";

my $list = shift @ARGV;
my $fasta = shift @ARGV;
my $out = shift @ARGV;
my %select;
open L, "$list" or die;
while (<L>) {
    $select{$_} = 1;
close L;

$/ = "\n>";
open O, ">$out" or die;
open F, "$fasta" or die;
while (<F>) {
    my ($id) = split (/\n/, $_);
    print O ">$_" if (defined $select{$id});
close F;
close O;
ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by JC9.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2347 users visited in the last hour