Question: Extracting Sequence From A 3Gb Fasta File
14
gravatar for Divya
9 months ago by
Divya220
Divya220 wrote:

Hi,
How to extract fasta sequence from an huge 3gb fasta file by giving sequence id as input using perl, Thanks in advance.

ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 0 • written 3.9 years ago by Divya220
1

I've modified your original question, as it was not very clear. You should put an example of your input file and an example of your output. Is your input file a fasta file?

ADD REPLYlink written 3.9 years ago by Giovanni M Dall'Olio17k
22
gravatar for Neilfws
3.9 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

If I read your question correctly, you have many sequences in a large file and you want to retrieve certain sequences by some ID.

One way to do this using Perl is first to index the file. If you install Bioperl and its accessory scripts, you can do this using bp_index.pl:

This example assumes that your fasta sequences are in myfile.fa in the current directory and you want to create the index file, myIndex, also in the current directory:

bp_index.pl -dir . -fmt fasta myIndex myfile.fa

You can then retrieve by ID using bp_fetch.pl. Assuming that you are in the same directory and you want the sequence with ID myID, something like:

bp_fetch.pl -dir . -fmt fasta myIndex:myID

It's been some time since I used these tools, so you should check the syntax and read up on them at the Bioperl website. In particular, it's possible to set up environment variables which point to a location with sequence indices, which can make life easier.

ADD COMMENTlink written 3.9 years ago by Neilfws41k

Hi Neilfws,

I used below cmd to extract RNA seq from fasta file using ID number from text file:

bp_index.pl -dir . -fmt fasta myIndex ./nr_latest.fasta
time bp_fetch.pl -dir . -fmt fasta myIndex:sequence.gi.txt

### Error: Sequence sequence.gi.txt in Database myIndex is not present.

Can you please give me any suggestion.

I would really appreciate your help.

Thanka in advance!

Naresh

ADD REPLYlink modified 4 months ago by Neilfws41k • written 4 months ago by nbvasani70
14
gravatar for Yahan
3.5 years ago by
Yahan290
Yahan290 wrote:

What about calling fastacmd through a system command?

First build the database using formatdb -i pathtodb -p F -o T

the last option is important.

Next you can use

fastacmd -d pathtodatabase -s seqID1,seqID2,...

option -L start,stop allows to retrieve subsequence

integrating in a system() command makes it useable in perl code.

ADD COMMENTlink written 3.5 years ago by Yahan290
1

just my 2cents..

fastacmd -d pathtodatabase -i list_of_seq_ids_to_search_in_a_file.txt

ADD REPLYlink written 2.9 years ago by Naga350

For me it was the first place to search for it but it has limited functionality due to the 1st line format requirement. It have to be not simply fasta file. See http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html#4.2.2

ADD REPLYlink written 2.3 years ago by Maciej Jończyk310

It seems that sequence header could not be in form:

GRMZM2G030710T01|PACid:19501430 When I changed it to GRMZM2G030710T01; PACid:19501430 this two commands works ok :-)

ADD REPLYlink written 2.3 years ago by Maciej Jończyk310
7
gravatar for Hamid Ashrafi
3.8 years ago by
Hamid Ashrafi70 wrote:

This Also works if you have BioPerl. You need to have the IDs that you want to extract them from fasta file in a separate file. You can use grep to generate that file.

#!/usr/bin/perl

#######################################################

#  Author: Hamid Ashrafi                              #
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with
# the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in
# the FASTA file.
#
#Requirement. Bio Perl
######################################################



use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library;
my %records;
open IDFILE, "NBS_LRR_IDs.txt" or die $!;
open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!;
#  name of the library file - (here it is hardcoded)
$fasta_library = 'NC_003070_all_chr.faa';

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";


# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {



      my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
      my $header = $database->header($id);
      #print "$header\n";
      print ">$header\n", $database->seq( $id ), "\n";
      print OUTPUT  ">$header\n", $database->seq( $id ), "\n";
}

exit;
ADD COMMENTlink modified 3.8 years ago by Pierre Lindenbaum58k • written 3.8 years ago by Hamid Ashrafi70
6
gravatar for Cheng Zhongshan
3.9 years ago by
Cheng Zhongshan340 wrote:

Actually, I use this perl script to extract some genes from genome, but I don't know it would be working in your case.

#!/usr/bin/perl

use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

#you can input you genes' name in the array or read it from other files.

my @genes_name=qw(FGSG_01562 FGSG_02646 FGSG_05604 FGSG_05896 FGSG_06871 FGSG_07409  FGSG_07952 FGSG_08962 FGSG_09852 FGSG_09945 FGSG_10286 FGSG_11792);

my $filename='fusarium_graminearum_3_proteins.fasta';

my $gb = Bio::SeqIO->new(-file   => "<$filename",
                              -format => "fasta");
my $fa = Bio::SeqIO->new(-file   => ">gball.fa",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) {

    #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"

    $fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);

}
ADD COMMENTlink modified 3.9 years ago by Istvan Albert ♦♦ 39k • written 3.9 years ago by Cheng Zhongshan340
6
gravatar for Hanif Khalak
3.9 years ago by
Hanif Khalak1.1k
Doha, QA
Hanif Khalak1.1k wrote:

If you can't install BioPerl, this vanilla Perl should work quite fast if you expect only one exact ID match:

#!/usr/bin/perl

# usage:  extractSeqByID.pl SEQ123 < huge.fsa > my.fsa

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        $seq =~ s/^>*.+\n//;  # remove FASTA header
        $seq =~ s/\n//g;  # remove endlines
        print "$seq\n";
        last;
    }
}
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Hanif Khalak1.1k
2
gravatar for 4Urelie
3.2 years ago by
4Urelie20
4Urelie20 wrote:

I needed to do the same thing, but I want to use a file with all the IDs in it, that is taken as an argument of the program (at the end I want to create automatically IDfiles and to a pipeline) I am a perl beginner, so it might be not clean, but I modified the script above and in my hands it is working.

And, also, thank you for this script, H. Ashrafi!

#!/usr/bin/perl

#######################################################
# Author: Hamid Ashrafi                              
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in the FASTA file.
#
# Requirement. Bio Perl
#
# Modified by Aurelie K
#   -> put the names of files in arguments (and also use the STDOUT for the output file)
#   -> suppr index file generated during the script
######################################################

#usage = path_to/fasta_getseq.pl path_to/fasta_library path_to/IDFILE > log

if ($#ARGV != 1) {
 print "usage:  fasta_getseq.pl IDFILE fasta_library > outputfile\n";
 exit;
}

use strict;
use Bio::DB::Fasta;

my $database;
my $fasta_library = $ARGV[1];   #path to fastalibrary in the second argument
my %records;

open IDFILE, "<$ARGV[0]" or die $!; #first argument is the path of the file containing all the IDs you need to extract
open OUTPUT, <STDOUT>;

# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";

# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {

  my ($id) = (/^>*(\S+)/);  # capture the id string (without the initial ">")
  my $header = $database->header($id);
  #print "$header\n";
  print ">$header\n", $database->seq( $id ), "\n";
  print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}

#remove the index file that is useless for user
unlink "$fasta_library.index";

#close the filehandles
close IDFILE;
close OUTPUT;
exit;
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by 4Urelie20

Good!It worked very well!

ADD REPLYlink written 2.8 years ago by Haiping100
2
gravatar for AGS
19 months ago by
AGS140
Brooklyn, ny
AGS140 wrote:

Simplest and I'd wager the fastest method:

Use Jim Kent's utils.

Awesome Tools

Specifically, faSomeRecords:

---> faSomeRecords 
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
 -exclude - output sequences not in the list file.

Or, if it really is only one sequence you are after, use faOneRecord

---> faOneRecord
faOneRecord - Extract a single record from a .FA file
usage:
 faOneRecord in.fa recordName
ADD COMMENTlink modified 19 months ago • written 19 months ago by AGS140
1
gravatar for Cheng Zhongshan
3.1 years ago by
Cheng Zhongshan340 wrote:

I use the following codes to extract fastas from a database and print each of my targets into different fastas.

#!/usr/bin/perl 
use Bio::Perl;
use Bio::SeqIO;
 use IO::String;
 use Bio::SearchIO;
 open TF,'File_contain_target_fasta_IDs.txt' 
 or die "Couldn't open this file: $!";   
 chomp (@genes_name=<TF>); 
 my $filename='Fasta_database.fasta';

      my $gb = Bio::SeqIO->new(-file   => "<$filename",
                                -format => "fasta");
     while($seq = $gb->next_seq) { 
        my @temp_array=grep {$seq->id eq $_}@genes_name;
        #this step is import for correct output
      foreach (@temp_array){
      #Only one element is given to $string, which is the key point for writing single sequence into one fasta file
         my $string=$_;
         #print $string,"\n";
         my $stringio = IO::String->new($string);
         #If the foreach function is not used here, the defaut $_, actually, here would be many $_ been transfered into $string, and this is resulted in wrong sequences written into one fasta file.
           my $out = Bio::SeqIO->new(-fh => $stringio,
                                     -format => 'fasta');
           # output goes into $string
           $out->write_seq($seq);
           # modify $string
           #$string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
           # print into STDOUT
           #print $string;  
           #subrountin for writing sequences into separated fasta file
           write_into_fasta($seq->id,$string);
                      }
        }

sub write_into_fasta { my $fh=$[0]; open $fh,">$fh.fasta" or die "Couldn't open this file: $!"; print $fh $[1],"n"; close $fh; }

ADD COMMENTlink written 3.1 years ago by Cheng Zhongshan340
1
gravatar for 2184687-1231-83-
2.9 years ago by
2184687-1231-83-4.5k wrote:

in terms of performance, from all open-source options that I've tried, the best so far has been cdbfasta:
http://sourceforge.net/projects/cdbfasta/
You first create an index, and then put all your sequence ids in a text file, and use cdbyank to fetch them like this:

/path/to/cdbyank big_file.fasta.cidx < list_of_ids.txt -o my_ids.fasta

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by 2184687-1231-83-4.5k
0
gravatar for Manisha
3.5 years ago by
Manisha0
Manisha0 wrote:

but if i have multiple sequences to be extracted then what should i do? if i put all sequence id in one text file and give it as input then the above program is not working

ADD COMMENTlink written 3.5 years ago by Manisha0
2

please, ask a new question for this problem, citing this post.

ADD REPLYlink written 3.5 years ago by Pierre Lindenbaum58k

I edited my answer

ADD REPLYlink written 2.9 years ago by 2184687-1231-83-4.5k

I've written a tool (called xcerpt) which takes a list of read ID's, and then scans a fasta file to extract them. No indexing or other work necessary, but of course, every time you call it, it will scan the whole file until all requested seqences are found (3GB ~ half a minute). http://hackage.haskell.org/package/clustertools

ADD REPLYlink written 2.9 years ago by Ketil3.3k
0
gravatar for Taslima
9 months ago by
Taslima0
Bangladesh
Taslima0 wrote:

I have changed a few of cheng zhongshan's script.just added to read from a file where i expect the ids should by separated by new line. here it goes:


#!/usr/bin/perl -w
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

my $input = shift;
my $id_file = shift;
my $output = shift;
my @genes_name;

open(FILE,$id_file) or die "Can't open $id_file";
while (<FILE>) {
my $id =$_;
chomp($id);
push @genes_name, $id;
}

my $gb = Bio::SeqIO->new(-file   => "<$input",
                              -format => "fasta");

my $fa = Bio::SeqIO->new(-file   => ">$output",
                              -format => "fasta",
                              -flush  => 0); # go as fast as we can!

while($seq = $gb->next_seq) { 
 #Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"
$fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);
}
ADD COMMENTlink modified 9 months ago by Neilfws41k • written 2.9 years ago by Taslima0
0
gravatar for Ryan Thompson
2.9 years ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

It's not Perl, but pyfasta is really easy to use for this purpose. From the help text:

Usage: extract some sequences from a fasta file. e.g.:
               pyfasta extract --fasta some.fasta --header at2g26540 at3g45640
ADD COMMENTlink written 2.9 years ago by Ryan Thompson2.4k
0
gravatar for Ketil
19 months ago by
Ketil3.3k
Ketil3.3k wrote:

The easiest is probably to use 'agrep':

agrep -d '^>' regex

will extract all sequences matching the regular expression regex. If you have a rough idea of sequence length, and you don't have agrep installed, you can use -A option to grep to look for the header line and extract a fixed number of lines after it. Or you can use 'awk':

awk 'BEGIN { RS=">" } /regex/ { print }'

which is equivalent to the agrep command.

ADD COMMENTlink written 19 months ago by Ketil3.3k

This code, although interesting, is risky to use. First, it removes the '>' in front of the sequence name. Second, it matches anything between the record markers, not only the name. What if my sequence name is ATT and ATT is found in many of the sequences or, worst, one of the sequences is ATT? Third, unless you specify a beginning of line and end of line in your regex, it will also recover partial matches (like getting sequence222 and sequence22 when searching sequence2).

ADD REPLYlink written 9 months ago by Eric Normandeau7.1k

for i in cat name.txt;do awk 'BEGIN{RS=">"} /'$i'/{print ">" $0}' test.fas;done This code can be used for multiple name searching.

ADD REPLYlink written 4 months ago by mayingfei0
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 193 views in the last hour