Extract Sequence From Fasta File Using Ids From A Separate File
7
3
Entering edit mode
8.5 years ago
IsmailM ▴ 110

Hi guys,

As shown below, I have two files a fasta file (containing a sequences and sequence IDs) and a text file containing a list of sequence ID.

I would like to extract the respective sequence for each of the sequence IDs in the text file. I have tried a number of times, but I simply cannot even open the fasta file - I'm a total beginner with regards to bioruby - if you could explain any answers in detail, I would be highly grateful.

EDIT: Could you give your answers using Ruby or BioRuby. Thanks

e.g. EXAMPLE FASTA FILE

>CATH_RAT
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
>CATH_MOUSE
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
>CATH_HUMAN
MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNMKMIELHNQEYREGK
HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA
>CATH_MONKEY
FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES
CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV

EXAMPLE .TXT FILE

   >CATH_MOUSE
   >CATH_HUMAN

Many Thanks

fasta • 26k views
ADD COMMENT
3
Entering edit mode
6.5 years ago
IsmailM ▴ 110

This is what I ended up doing - by using 'pure' ruby.

require 'bio'
# create a hash of the Fasta file
def analyse_input_file(input_file)
  input_read = {}
  biofastafile = Bio::FlatFile.open(Bio::FastaFormat, input_file)
  biofastafile.each_entry do |entry|
    input_read[entry.entry_id] = entry.seq
  end
  input_read
end

# Iterate ID file and look up in hash
def iterate_id_file(id_file, input_read, output_file)
  ids = File.read(id_file)
  ids.each_line do |id|
    # remove the inital > and the end of line character.
    id = id.gsub!(/^>/, '').chomp
    seq = input_read[id]
    File.open(output_file, 'a+') do |f|
      f.puts ">#{id}"
      f.puts seq
    end
  end
end

# Set up
input_file = ARGV[0] # '/Volumes/Data/input_file.fa'
id_file = ARGV[1] # '/Volumes/Data/ids.txt'
output_file = "#{input_file}.out"

# Set up
input_read = analyse_input_file(input_file)
iterate_id_file(id_file, input_read, output_file)

Just a quick explanation on how the above works... (I know I would have found this really helpful, when I asked the question)

The main concept used here is the HASH (also known as a dictionary).

The basic arrangement of a Hash is as follows

Hash_name = { 'key_1' => 'value_1', 'key_2' => 'value_2'  }

This is also written as:

Hash_name = { key_1: 'value_1', key_2: 'value_2'}

The main benefit of Hashes is that you can quickly call a value by passing the respective key as follows

puts Hash_name[key_1]
# => value_1

puts Hash_name[key_2]
# => value_2

In the analyse_input_file method, we created a hash where the key was the sequence id (i.e. entry.entry_id) and the sequence itself was the value (i.e. entry.seq)

Then in the iterate_id_file, we simple passed the id to the HASH to get the corresponding sequence....

Simples...

ADD COMMENT
28
Entering edit mode
8.5 years ago
matted 7.5k

Why does it need to be Ruby?

Here's a quick shell command using samtools (better than a buggy one-off Perl/Python/Ruby script in my opinion...):

cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA
ADD COMMENT
0
Entering edit mode

how should we extract the multiple sequence in fasta format by using the gene_id through samtools?

ADD REPLY
0
Entering edit mode

@Bulbul Did you find the solution?

ADD REPLY
0
Entering edit mode

@matted Would you please elaborate?

ADD REPLY
4
Entering edit mode
8.3 years ago
AGS ▴ 230

The best tool for this, imo, is Jim Kent's faSomeRecords.

While the OP wanted a Ruby-based solution, I think listing this answer for future reference is worthwhile.

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.
ADD COMMENT
0
Entering edit mode

Where can I find this software?

ADD REPLY
0
Entering edit mode

It's on UCSC

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

you can install it via anaconda: conda install ucsc-fasomerecords

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

Many thanks is it possible to do something similar in ruby???

It just that I don't really know any python

ADD REPLY
1
Entering edit mode
8.5 years ago
Fedor Gusev ▴ 210

Here is a script I use:

#!/usr/bin/env ruby

require 'bio'
require 'trollop'

opts = Trollop::options do
  opt :get, "Contigs to get", :type => :string, :required => true
end

get = File.readlines(opts[:get]).map { |x| x.chomp }

Bio::FlatFile.foreach(ARGF) do |entry|
  id = entry.definition.split.first # no idea why I did not use entry.entry_id
  puts entry if get.include?(id)
end

For your data you can run it like this:

ruby ./fasta-get-contigs -g <(cat example.txt | sed 's/^>//') example.fasta > result.fasta

Or you can remove leading > in the script.

But I second samtools faidx answer from @matted: from my experience, fasta parser in bioruby is too slow.

ADD COMMENT
1
Entering edit mode
4.4 years ago
BPors ▴ 50

If you had only the sequences in the example.txt instead of the identifiers, how would you grab the identifiers and their sequences? ( in this case, the purpose was retrieving the sequences with identifiers, what I am asking is retrieving headers and sequencers with known sequences

ADD COMMENT
0
Entering edit mode

In that case you can use BBMap's filterbysequence.sh:

filterbysequence.sh in=a.fa ref=b.fa out=filtered.fa

However, the files have to be valid fasta or fastq.

ADD REPLY
0
Entering edit mode
8.5 years ago
always_learning ★ 1.1k
           #use/bin/perl 
           use Data::Dumper;
           %val=();
           open (IN1,"$ARGV[0]");
           open (IN2,"id.txt");
            while(<IN1>){
       chomp $_;
       @file= split(">",$_);
    $file[0]=~ s/\s*$//g;
    if ($file[0] ne ''){
    $seq = $file[0];
    }
    $file[1]=~ s/\s*$//g;
     if ($file[1] ne ''){
     $id = $file[1];
}
$val{$id} = $seq;
}
    while(<IN2>){
        $one = $_;
        $one =~ s/\s*$//g;
        print "$one\n";
        print "\n";
        #print Dumper $one;
        print  "$val{$one}\n";
            }

Run this perl file.pl fasta.txt

ADD COMMENT
0
Entering edit mode

Another Perl solution:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use extractSeq.pl LIST FASTA OUT\n";

my $list = shift @ARGV;
my $fasta = shift @ARGV;
my $out = shift @ARGV;
my %select;
open L, "$list" or die;
while (<L>) {
    chomp;
    s/>//g;
    $select{$_} = 1;
}
close L;

$/ = "\n>";
open O, ">$out" or die;
open F, "$fasta" or die;
while (<F>) {
    s/>//g;
    my ($id) = split (/\n/, $_);
    print O ">$_" if (defined $select{$id});
}
close F;
close O;
ADD REPLY

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6