Question: Searching through a multi-fasta file for records containing certain words in their description in BioPython 1.65, Python 3.4
1
gravatar for tyleraelliott
4.4 years ago by
Canada
tyleraelliott50 wrote:

I have a fasta file which contains thousands of sequences, with headers as such:

>scaffold_1|c(135298..135582)|DNA|DNA-0-1_NV

Each pipe-deliminated section of the header can vary from sequence to sequence, and some sequences might have identical headers except for the first or second sections.

I need to be able to search through this large file and pick out and print to another file specific sequences based upon their header. There needs to be degeneracy in this search however. I have seen examples where a library text file was used but only exact matches between the fasta file and library file would work.

For instance, let's say I want all sequences which have any variation on 'piggyBac' in their header (so PiggyBac, piggybac,DNA-piggyBac, etc.). 

I'm just at a loss as to how to do this exactly. Is there some way to index this file and then search the keys for variations on 'piggyBac'? If anyone has suggestions or can point me to code that does something similar it would really be helpful. 

 

I appreciate it

 

biopython • 3.4k views
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by tyleraelliott50

I should mention I'm using Python3 and the latest release of BioPython

ADD REPLYlink written 4.4 years ago by tyleraelliott50

It would be better to say which version of Python 3 (e.g. 3.3? 3.4?) and which version of Biopython (e.g. 1.65?), partly people may be reading this question in a years time, but also in case it helps give you are more accurate answer.

ADD REPLYlink written 4.4 years ago by Peter5.8k

Edited to show versions for both

ADD REPLYlink written 4.4 years ago by tyleraelliott50
4
gravatar for Peter
4.4 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The most elegant solution would use an iterator approach. This is based on the "Filtering a sequence file" example in the tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html only rather than filtering based on the identifiers, here we are checking the description:

from Bio import SeqIO
input_file = "big_file.fasta"
output_file = "piggybac.fasta"
# The next line is a Python generator expression - memory efficient!
records = (r for r in SeqIO.parse(input_file, "fasta") if "PIGGYBAC" in r.id.upper())
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))

Updated with variable name correction pointed out by  tyleraelliott

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Peter5.8k
2
gravatar for Whetting
4.4 years ago by
Whetting1.5k
Bethesda, MD
Whetting1.5k wrote:

I would do something like

 

from Bio import SeqIO
with open("new_file.fas","w") as out:
    for record in SeqIO.parse("fasta.fas","fasta"):
        if "PIGGYBAC" in record.id.upper():
            print(">"+record.id, file=out)
           print(record.seq, file=out)

This should normalize the caps and look for your term in the header...

 

Cheers

 

 

 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Whetting1.5k

That did the trick! All I need to do now is to be able to write those sequence to a new fasta file

ADD REPLYlink written 4.4 years ago by tyleraelliott50

see edit for file printing

ADD REPLYlink written 4.4 years ago by Whetting1.5k

Note that the OP is using Python3, so print >>out, ">"+record.id should be print(">"+record.id, file=out), or even better yet use Bio.SeqIO to write a FASTA record using the Seq object.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Matt Shirley9.0k

I did not know that python 3 used different print, but thanks for the note. Edited

ADD REPLYlink written 4.4 years ago by Whetting1.5k
0
gravatar for tyleraelliott
4.4 years ago by
Canada
tyleraelliott50 wrote:

I was going to post last night that the previous coding still gave me an error but I had reached my limit. This new coding works! Although I had to make one change to it:

if "PIGGYBAC" in record.id.upper())    modified to  

if "PIGGYBAC" in r.id.upper.())

Now it works just fine. Thanks to Dr. Cock, Matt Shirley and Whetting for all your help. 

 

 

ADD COMMENTlink written 4.4 years ago by tyleraelliott50

I'm glad we could help. Note BioStars (and StackExchange) work on a question and answer format. Your reply would be better as a comment, and you can tick an answer to accept it (which is linked to the user scoring system which some people find motivating).

ADD REPLYlink written 4.4 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2168 users visited in the last hour