Question: How To Extract A Sequence From A Big (6Gb) Multifasta File ?
9
gravatar for Mchimich
5.4 years ago by
Mchimich240
France
Mchimich240 wrote:

I want to extract some sequences using ID from a multifasta file.

Using perl is not possible because it gave an error when indexing the database. Maybe because of it's size? Is there any way to this differently? thanks for your help!

fasta • 36k views
ADD COMMENTlink modified 2.2 years ago by anjanaram170 • written 5.4 years ago by Mchimich240

Your question can not be answered without more details.

ADD REPLYlink written 5.4 years ago by Zev.Kronenberg11k

If BioStar still suggested similar questions, you would have seen this one which answers your (identical) question: http://www.biostars.org/post/show/1195/extracting-sequence-from-a-3gb-fasta-file/.

ADD REPLYlink written 5.4 years ago by Neilfws47k
38
gravatar for matted
5.4 years ago by
matted6.5k
Boston, United States
matted6.5k wrote:

How about samtools faidx? It seems to be less well-known, but I think it should work for you.

If you have these reads in test.fa:

>one
AAAAA
>two
CCCCC
>three
GGGGG

Then just run:

samtools faidx test.fa

This will create an index file that will allow you to do sublinear queries for each name lookup (faster than the awk/grep solutions above and easier than rolling your own script).

For example:

samtools faidx test.fa two

Will output:

>two
CCCCC

It works for multiple regions (samtools faidx test.fa one two three) and that lets you play tricks like:

xargs samtools faidx test.fa < names.txt
ADD COMMENTlink written 5.4 years ago by matted6.5k
1

I'd like to point out pyfaidx as a drop-in replacement for samtools faidx. It's quite handy if you're planning to use the regions in a Python script as well.

ADD REPLYlink written 22 months ago by Matt Shirley8.0k
1

FYI, you may need to use single quotes around your fasta header name if it's a more complicated name (e.g. gi|12345|name). Then you would need to do: samtools faidx test.fa 'gi|12345|name'

ADD REPLYlink modified 19 months ago • written 19 months ago by toddknutson50
1

How would you do this if you want to grab multiple sequences that have a similar initial ID? In my case some sequences belong to exon 1 and so begin with ">1;.....", and I would want to get all of those beginning with ">1;..." using samtools?

ADD REPLYlink written 8 months ago by balasink10
2
pip install pyfaidx

then

faidx --regex "^((?!>1;).)*$" input.fa > output.fa

or

faidx --invert-match --regex "^>1;.*$" input.fa > output.fa

The first example uses negative lookaheads, which may be more difficult to reason about, while the second example depends on the -v | --invert-match flag in the faidx script.

Here's a regex sandbox to play in: https://regex101.com/r/tJRWRY/1/

ADD REPLYlink modified 8 months ago • written 8 months ago by Matt Shirley8.0k

Please note that by default the sequence header extracted by pyfaidx is truncated on first whitespace.

ADD REPLYlink written 5 months ago by al-ash0

You can output the entire sequence header using the -f | --full-names argument. This also applies to regular expressions matching.

ADD REPLYlink written 5 months ago by Matt Shirley8.0k

I have the same question - does samtools faidx support regular expression/wildcards? (as I'm using samtools faidx in some of my previous scripts I'd prefer to stay with it rather than migrating to the python faidx)

ADD REPLYlink written 5 months ago by al-ash0

samtools does not support regular expressions matching, or any type of partial matching.

ADD REPLYlink written 5 months ago by Matt Shirley8.0k

neat example, samtools has many handy features

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 75k

Nice one!I really like this way ~

ADD REPLYlink written 5.4 years ago by GAO Yang250

what 's the function of "xargs" here

ADD REPLYlink written 4.1 years ago by mayingfei10

I believe this is equivalent to writing "cat names.txt | xargs samtools faidx test.fa".

ADD REPLYlink written 4.1 years ago by Chris Miller19k
3
gravatar for Istvan Albert
5.4 years ago by
Istvan Albert ♦♦ 75k
University Park, USA
Istvan Albert ♦♦ 75k wrote:

You can use blast to extract sequences from a large file, see below. That being said I am not sure how well this works if the file is large solely because it contains a very large number of sequences. I think the original use case of BLAST optimizes for retrieving locations from long sequence.

# look at the file
$ head EC4115.fa 
>NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome.
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTC
TGACAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC

# generate the blast database
$ makeblastdb -dbtype nucl -out EC -in EC4115.fa -parse_seqids

# retreive an entry by id
$ blastdbcmd -db EC -entry 'NC_011353.1' | head
>lcl|NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
....

# query the blast database by id and coordinates
$ blastdbcmd -db EC -range 100-105 -entry 'NC_011353.1'
>lcl|NC_011353.1:100-105 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
TTAAAA
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Istvan Albert ♦♦ 75k

Thanks for your response ! I will try that. My database is paired-ends reads that contain Millions of small sequence of 76bp . I want to use some IDs to extract the correspondents sequences but using bio::db::fasta module in perl the indexing step is too long and bug ! I'm not sure that generating a blast database will take less time ! Thanks

ADD REPLYlink written 5.4 years ago by Mchimich240
3
gravatar for Chris Miller
5.4 years ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

I believe this perl script will do what you need (just whipped it up - excuse any typos)

my $inseq = 0

my $inFh = IO::File->new( myfile.fa ) || die "can't open file\n";
while( my $line = $inFh->getline )
{
    chomp($line);

    if($inseq){
        if($line =~ /^>/){
            $inseq = 0;
        } else {
            print $line . "\n";
            next;
        }
    }

    if($line =~ /^>/ && $line =~/$ARGV[0]/){
        $inseq = 1;
        print $line . "\n";
    } 
}
close($inFh);

Run it like:

$perl myscript.pl SequenceName1234 >output
ADD COMMENTlink written 5.4 years ago by Chris Miller19k

Here is another way to use perl to do this task. It doesn't require BioPerl.

1) The database

>HMPREF1307_00002 protoporphyrinogen IX dehydrogenase [menaquinone] [Klebsiella pneumoniae subsp. pneumoniae WGLW3]
MKTLILFSTRDGQTREIASFLASELKELGIDADTLNLNRTDVVEWHHYDRVVIGASIRYG
HFHPAVDRFVKKHLAALQALPGAFFSVNLVARKPEKRTPQTNSYTRKFLLNSPWQPQSCA
VFAGALRYPRYSWYDRFMIRLIMKMTGGETDTRKEVVYTDWQQVSRFAREIAQMARK
>HMPREF1307_00003 trk system potassium uptake protein trkH [Klebsiella pneumoniae subsp. pneumoniae WGLW3]
MHFRAITRIVGLLVILFSGTMIVPGLVALIYRDGAGRAFTQTFFVALAIGSMLWWPNRKQ
KGELKSREGFLIVVLFWTVLGSVGALPFIFSEQPNLTVTDAFFESFSGLTTTGATTLVGL
DSLPHAILFYRQMLQWFGGMGIIVLAVAILPILGVGGMQLYRAEMPGPLKDNKMRPRIAE
TAKTLWLIYVLLTIACALALWFAGMPAFDAIGHSFATIAIGGFSTHDASVGYFNSPMINS
IIAIFLLISGCNYGLHFSLLSGRSLKVYWRDPEFRMFIGVQLTLVIVCTLVLWLHNVYGS
VLTTLNQAFFQVVSMATTAGFTTDSIARWPLFLPVLLLCSAFIGGCAGSTGGGLKVIRIL
LLFKQGNRELKRLVHPNAVYSIKLGNRALPERILEAVWGFFSAYALVFIISMLAIIATGV
DDFSAFASVVATLNNLGPGLGVVADNFATMNPVAKWILIANMLFGRLEVFTLLVLFTPTF
WRE

................................

................................

2) Sequences for retrieval

HMPREF1307_00002

HMPREF1307_00003

3) The perl script to do it

#!/usr/bin/perl -w
#This script takes a list of fasta id and retrieve the corresponding fasta seqs from the database 
#Usage: perl retrieveFasta2.pl listOfId database.fasta > result

my $line = '';
my $fasta = '';
my %hash = ();
my @fileretrieve = ();
my $header = '';
my $dna = '';

open(FILE,$ARGV[0]) or die "can't open file";
while (<FILE>) {
chomp ($_);
push (@fileretrieve,$_);
}

close FILE or die "can't close file";

open(FILE,$ARGV[1]) or die "can't open file";

while (<FILE>) {

chomp ($line = $_);

if ($line =~ /^>.*$/) {
    if (exists ($hash{$line})) {
        $fasta = "dupl";
    } else {
    $fasta = $line;
    #print "$fasta\n";    
    }
    
    } elsif ($line !~ /^>.*$/) {
        if ($fasta ne "dupl"){
            $hash{$fasta} .= $line;
        }
    }

}

close FILE or die "can't close file";

while (($header,$dna) = each(%hash)) {
foreach my $file(@fileretrieve) {
    if ($header =~ /$file\s/) { 
        print "$header\n$dna\n";
    }
    
}
}

 

Note: See usage in the perl script on how to use this script. I managed to use this script to retrieve over 300 sequences from a database which is about 750Mb containing about 2 million protein sequences.  

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by lloydlow0
1
gravatar for Leszek
5.4 years ago by
Leszek3.8k
IIMCB, Poland
Leszek3.8k wrote:

If you are going to retrieve sequences multiple times, I recommend building SQL db. I recommend SQLite as it's very light, yet powerful program: you don't need to run a served, the db is in one file and it has bindings to python (and probably to other languages).
We have successfully used SQLite with sets of several millions of sequences (fasta having several GBs). The query time is much below 1 sec if you properly index your sequence table.

ADD COMMENTlink written 5.4 years ago by Leszek3.8k
1
gravatar for Manu Prestat
5.4 years ago by
Manu Prestat3.8k
Marseille, France
Manu Prestat3.8k wrote:

One of the difficulties is the lack of standards in a fasta header format. Considering you're interested in the first element in the header as an ID. Here is my python proposal (change the IDclean line if you want an other ID in your header), I used the dictionary type to index the fasta first: it makes this process and the subsequent ID extraction very fast ;-)

#!/usr/bin/python

### extracts sequences from a fasta file (arg 1)
### whose id is in the IDs file (arg 2)

import string
import sys
ListOfIds = sys.argv[1]
fastafile = sys.argv[2]

try:
    ids=open(ListOfIds, 'r')
except IOError, e:
    print "File error: ",ListOfIds
    pass


lignes = ids.readlines()
req=[]
for ligne in lignes:
    req.append(ligne.strip())


#### reading the fasta file to cut
handle = open(fastafile)

bank={}
seqIDmap={}
seq_id = handle.next()
while (seq_id[0]!=">"):
    seq_id = handle.next()
while True:
    try:
        seq = handle.next()
        line = handle.next()
        while (line[0]!=">"):
            seq = seq+line
            line = handle.next()
        bank[seq_id]=seq
        IDclean=string.split(seq_id, " ")[0][1:].strip()
        seqIDmap[IDclean]=seq_id
        seq_id = line # for the next
    except StopIteration:
        break
# last line
bank[seq_id]=seq
seqIDmap[string.split(seq_id, " ")[0][1:].strip()]=seq_id

handle.close()

######## end reading the potentially big fasta file

faName=fastafile.split("/")[-1]
listName=ListOfIds.split("/")[-1]
subsetName=listName+"-"+faName
subset = open(subsetName,"w")
nbNF=0
for i in req:
    try:
        subset.write(seqIDmap[i].strip()+"\n")
        subset.write(bank[seqIDmap[i]].strip()+"\n")
    except KeyError:
        print i, "not found in fasta"
        nbNF+=1

subset.close()

print
print nbNF, "IDs (listed above) from",listName, "have not been found in", faName
print
print "the Subset fasta file", subsetName, "is now created"
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Manu Prestat3.8k
1

Here's a simple Python solution for one or multiple ID's, can also be adopted to read files containing a list of ID's one per line by adding an extra for loop to populate the ID list. It returns entries from ~1GB files in about a second on my 5th-gen i5.

   import sys

    #Read in target file

    #Setup list
    ids = ["AccessionNumber1", "AccessionNumber2", "etc..."]

    #Set conditional
    go = False

    #Read target file
    target = open("C:/file_path.txt")

    for line in target:
        if go == True:
            if ">" in line:
                go = False
                continue
            else:
                sys.stdout.write(line)
                continue
        else:
            for id in ids:
                if id in line:
                    go = True
                    sys.stdout.write(line)
                else:
                    continue

    target.close()
ADD REPLYlink written 10 months ago by ejspina10
0
gravatar for Istvan Albert
5.4 years ago by
Istvan Albert ♦♦ 75k
University Park, USA
Istvan Albert ♦♦ 75k wrote:

A different approach using Heng Li's awk (named below as hawk) also described here would be like so

# extract the sequence matchin a read id
$ hawk -c fastx ' $name ~ /NC_011353.1/ {print $seq }' < EC4115.fa

This example will read through the entire file upon each invocation.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Istvan Albert ♦♦ 75k
0
gravatar for JC
5.4 years ago by
JC6.4k
Mexico
JC6.4k wrote:

I will use some code in perl, I'm assuming that your fasta file have the ID as the only text in the fasta comment

 #!/usr/bin/perl

 # extractSeq.pl -> extract fasta sequence from a list of IDs
 use strict;
 use warnings;
 $ARGV[2] or die "use perl extractSeq.pl LIST FASTA OUPUT\n";
 my ($list, $fasta, $out) = @ARGV;
 my %list;
 open L, "$list" or die;
 while ( < L > ) {
      chomp;
      $list{$_} = 1;
  }
  close L;

  open O, ">$out" or die;
  open F, "$fasta" or die;
  $/ = "\n>";
  while ( < F > ) {
       s/>//g;
       my @seq = split (/\n/, $_);
       my $id = shift @seq;
       next unless (defined $list{$id});
       print O join "\n", ">$id", @seq;
  }
  close O;
  close F;

*edit: remove the spaces in the file handles in the "while ()" statement, autoformating is messing with the code view

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by JC6.4k
0
gravatar for Mchimich
5.4 years ago by
Mchimich240
France
Mchimich240 wrote:

Thanks a lot everybody for your precious help ! I try your solution both on my Desktop computer and a cluster ! this is the time taking by some solution on the cluster : For Chris miller solution : the scripts works well ! but make 1m19.999s to retrieve one ID ! so it's too long :( matted solution : 0m42.859s to retrieve one ID JC solution : 1m50.973s to retrieve one ID. Istvan Albert solution : generating the blast database is time consuming ! finally i will use grep that take 0m28.827s on my computer and less and 0m1.271s on the cluster

grep -A 1 ID file.fasta >output

Of course it's a solution for only "BIG database".

Thanks everybody for taking time to help others.

ADD COMMENTlink modified 5.3 years ago • written 5.4 years ago by Mchimich240
1

but how does grepping a Fasta file for an id return the sequence for that id?

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 75k

but that grep only gets the fasta header, not the full sequence. Do I missing something?

ADD REPLYlink written 5.4 years ago by JC6.4k

You could do grep -A 1 ID file.fasta to get the following line.

ADD REPLYlink written 5.4 years ago by matted6.5k

yes I know that, but in a strict sense, the fasta file can contain multiple lines of sequence

ADD REPLYlink written 5.4 years ago by JC6.4k

You're right, I was just thinking of the short read Illumina output case. I don't know how to get grep to do the general case.

ADD REPLYlink written 5.4 years ago by matted6.5k

ha, I've not known about the -A flag but it is something I've always needed ;-)

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 75k

I'm sorry I forget the -A 1 in the command line ! and as matted said is for Illumina read with single line ! thanks everybody for your answers !

ADD REPLYlink written 5.4 years ago by Mchimich240

I'm looking to pull out the full FASTA entry from a multiFASTA file using grep given the ID that appears in the header. This entails grabbing two more lines from a file that looks like the following:

>Full header information with matching ID
GATATC
ATATGC
>Full header for next entry with ID
TTTAAGGG
GGGAACC

The sequence data is spread across multiple lines (obviously with line breaks). The following command / RegEx is close to what I need.

pcregrep -M 'searchID.*(\n|.)*>' query.faa

# -M multi lines
#(\n new lines and any chars) - until it finds another ">"

The following problems are:
1) I don't want / need the following full header line that it grabs looking for the next entry with the terminating ">" in the RegExpression.

2) The Regular Expression doesn't work when it's the last entry, i.e. I need to "OR" the EOF to the ">" terminating character. Any way to match up to (">" OR EOF) and not include the following header line that starts with ">" in the regular expression?

Right now it returns the following:

>Header1
GATCG
TATTTA
CATTAT
>Header2

OR returns nothing on the last entry since there isn't another entry beginning with ">"

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by shannon.steinfadt0

Thank you for this script

grep -A 1 2015.M.420nP seqs_non_chimera.fna >output1

But also I have a question: I am getting and output with a doble "-" ("--") separating each fasta line. I would like to get rid of it, if not, at least knowing that will not affect my QIIME downstring analysis.

>2015.M.420nP
TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAGATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGGAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG
--
>2015.M.420nP
TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG
--
>2015.M.420nP
TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG

Thank you a lot for your help

ADD REPLYlink modified 14 months ago • written 14 months ago by danielavargasrobles0
0
gravatar for David Langenberger
5.4 years ago by
Deutschland
David Langenberger8.0k wrote:

There is also a UCSC tool for extracting multiple entries of a fasta file: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faSomeRecords

ADD COMMENTlink written 5.4 years ago by David Langenberger8.0k
0
gravatar for anjanaram1
2.2 years ago by
anjanaram170
United States
anjanaram170 wrote:

how can i extract the first 10 fasta sequences from a file?

ADD COMMENTlink written 2.2 years ago by anjanaram170

which type of sequence and file you have? In which programm you want to do this. There are many ways wo do this.

ADD REPLYlink written 2.2 years ago by tcf.hcdg40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1386 users visited in the last hour