Question: Comprehension To Extract Fasta Sequences From Dataset
1
gravatar for s.charonis
6.8 years ago by
s.charonis70
s.charonis70 wrote:

Dear BioStar Community,

I am analyzing a dataset of bacterial proteins structured as follows:

'>PDB1a0na_unknown

PPRPLPVAPGSSKT

'>PDB1a1ta_ENZ

MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

etc ....

This dataset has about 15000 entries. What I am trying to do is extract all proteins according to their annotations (ENZ, MEM, unknown, etc.) to perform sequence/structure analysis on them. I am using a list comprehension to do this as follows (ENZ annotation shown here):

def retrieve_enzyme_proteins(filename):
    with open(filename) as file:
        for line in file:
           # search for the proteins annotated as enzymes (ENZ)
            if line[0]='>' and '_ENZ' in line:
                # I need a line here telling the function to jump to the next line and extract the sequence
                return [line.strip('\n') for line in file if line[0] != '>']

I have only been able to extract EVERY protein sequence so far and would like a logical structure that tells my program to look for the desired annotation (e.g. ENZ) and then jump to the next line where its sequence begins and extract that sequence? Apologies if the indentation is messed up, the copy-paste does that. I am using Python 2.7.3. Any help would be much appreciated!

fasta python • 3.5k views
ADD COMMENTlink modified 6.6 years ago by mayingfei10 • written 6.8 years ago by s.charonis70
2
gravatar for Matt Shirley
6.8 years ago by
Matt Shirley9.3k
Cambridge, MA
Matt Shirley9.3k wrote:

Sorry, I got sidetracked by the new iPhone announcements. Extending the code you provided leads me to this:

def retrieve_enzyme_proteins(filename, filt):
    """Open a fasta file, read line by line and determine if line is a header. If line 
    is a header, store annotation and write out existing sequence if annotation type is in 
    `filter`.

    filt: a list of annotations we are interested in e.g. ['ENZ', 'MEM']"""
    with open(filename, 'r') as fasta:
        annotation = None
        sequence = str()
        for line in fasta:
            if line[0] == '>': ## identify header
                header = line.strip()
                if annotation in filt:
                    return (header, sequence)
                annotation = header.split('_')[-1] ## set annotation as last element of header
                sequence = str()
                continue
            else:
                sequence += line.strip()
        if annotation in filt:
            return (header, sequence) ## return final sequence and header

And using your example file, here is how you would use the function:

>>> retrieve_enzyme_proteins('80929.fa', ['ENZ'])
('>PDB1a1ta_ENZ', 'MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN')

A few things to mention:

  1. Please try to avoid using variable names such as file that conflict with the global namespace of the base Python modules.
  2. You'll want to use parentheses whenever doing multiple boolean comparisons such as if (x > 10) and (y <= 5):
  3. Note that when making comparisons such as x == 5 you use two =, and when assigning a value to an object such as x = 5 you only use one.
  4. It's usually best to think in terms of your program reading data line by line, acting on that line, then moving to the next line. There is no such easy thing as reading a line and then 'jumping forward' to do something, while implying that we are still 'on' the old line. It's much better to store lines as objects that we then use later on during the processing of a different line.
ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Matt Shirley9.3k

@Matt Many thanks for the code and tips, the function raises "UnboundLocalError: local variable 'sequence' referenced before assignment", is this because the line sequence=str() should be before the return statement?

ADD REPLYlink written 6.8 years ago by s.charonis70

I think you might have a problem with your file, but I've updated the code above so that you shouldn't get the error. This error just meant that the line sequence += line.strip() was being evaluated before a header was encountered. You might have a malformed line at the beginning of your file.

ADD REPLYlink written 6.8 years ago by Matt Shirley9.3k

From what it looks like, OP has a blank line between the FASTA definition line and the associated sequence:

>

AAAAAAA

OP, this is not the correct format for FASTA. A blank line after a header delimits the end of that entry.

ADD REPLYlink written 6.8 years ago by pld4.8k
1
gravatar for pld
6.8 years ago by
pld4.8k
United States
pld4.8k wrote:

Here is a compact function to pull the sequences out of a multiple-entry FASTA file for those entries containing header in the FASTA definition line:

def fasta(indir, header):
    return [reduce(lambda x,y: x+y, x.split('\n')[1:]) 
         for x in filter(lambda x: header in x, open(indir, 'rb').read().split('>')[1:])]

This can be expanded into a few steps if you want:

#read the raw file
raw = open(indir, 'rb').read()  

#then i want to generate a list with each fasta entry as an index
#eg [">EMZ\n AAAA\n", ">PDB\n GGGG"]
splitfi = raw.split(">")[1:]

#then I want to collect only those who have the right header
header = "EMZ"
cleaned = filter(lambda x: header in x, splitfi)

#then i concat the sequences and drop the headers
final = [reduce(lambda x,y: x+y, x.split("\n")[1:]) for x in cleaned)

Note, with windows newlines are delimited by \r\n so you'll have to adjust the code to meet that difference.

Say your file contents are

>EMZ
AAAAAAAA
>PDB
GGGGGGG

Calling the fasta function above would look like:

`>>> fasta(myfile, 'EMZ')`
['AAAAAAAA']
ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by pld4.8k

@joe Many thanks for your help, that is quite a complicated function! I used the non-function, expanded version and it worked well. Can I ask one more thing: Having all sequences now stored within a list, each being its own list as: [ ['header', 'sequence'] is there a way to write these sequences to a file so that I can run a script on them? I've tried using the "WRITE" function python has but it raises a TypeError: expected a character buffer object when doing open()

ADD REPLYlink written 6.8 years ago by s.charonis70
1

Without seeing your code, and assuming your sequences are stored as [[h1,s1],[h2,s2]]:

def writeFasta(name, seqs):
    with open(name, "w") as fi:
        map(lambda x: fi.write(x[0] + "\n" + x[1] + "\n"), seqs)

I am assuming your are just calling write on your list, you have to iterate through the list.

ADD REPLYlink written 6.8 years ago by pld4.8k

+1 because your solution should work, bit it might be best to avoid both anonymous (lambda) functions as well as map/reduce in this answer:

def writeFasta(name, seqs):
    with open(name, "w") as fi:
        for pair in seqs:
            header, sequence = pair
            fi.write('{0}\n{1}\n'.format(header, sequence))
ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Matt Shirley9.3k

I would argue that there isn't anything wrong with using map/reduce and anonymous functions, especially when you're just using it to do some simple currying. You can take it too far, but I've never understood the argument for mapping a lambda being harder to read than a loop.

ADD REPLYlink written 6.8 years ago by pld4.8k

I don't think there is anything wrong with using them, but for someone who is beginning to learn programming paradigms I do believe that a loop is universally easier to understand than map/reduce on a list, because as you said above, eventually "you have to iterate through the list", and iteration is where most people start.

ADD REPLYlink written 6.8 years ago by Matt Shirley9.3k

@Joe: Many thanks for your help, that works fine. As you said, I just call write on my list and did iterate through it, but a TypeError was raised. That's fixed now!

ADD REPLYlink written 6.8 years ago by s.charonis70
0
gravatar for karl.stamm
6.8 years ago by
karl.stamm3.6k
United States
karl.stamm3.6k wrote:

I don't know much about Python, but if your text file is properly formatted fasta then this can be solved with grep.

Given this text file saved as temp.fasta

   >PDB1a0na_unknown 
   PPRPLPVAPGSSKT
   >PDB1a1ta_ENZ
   MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN
   >ASDFG
   ABCDEFGS
   >ONETWOTHREE
   ANOTHERRANDOMSEQUENCE

I run this command:

grep -A 1 "_ENZ" temp.fasta
>PDB1a1ta_ENZ
MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

Then just strip out the lines with > by piping to another grep.

grep -A 1 "_ENZ" temp.fasta | grep -v '^>'
MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN

The key here is the "-A 1" which returns the next line AFTER a search match for "_ENZ".

EDIT: What if your protein sequences span more than one line? If you need to pull out everything between the record identifier lines ( > symbol indicates new record ) then the code gets a little more tricky. A programming expert would not consider this tricky, but we're just trying to get the work done as quickly and correctly as possible, so I do not recommend trying to write any new code.

There exists a tool called faSomeRecords provided by the UCSC, compiled to Linux or Mac where you can provide a list of the record names and it will extract them all. Get your record names with

grep  "_ENZ"  temp.fasta > records.txt

Then extract those sequences with

faSomeRecords temp.fasta records.txt output.txt

Now if you need this to run under windows, we're going to have to find another similar solution.

I found this answer over here http://seqanswers.com/forums/showthread.php?t=9498 and there is much discussion of perl based solutions that should work under any operating system.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by karl.stamm3.6k

@karl: Many thanks for that, it almost does what I want except for one thing: almost all of the sequences span multiple lines (I put the shorter ones here for clarity). The pipeline you suggested picks up only the first line. Any way to specify all lines between >PDB annotation and the next such annotation?

ADD REPLYlink written 6.8 years ago by s.charonis70

grep works on each line individually, it wont be easy to do. You might be able to remove those line-breaks, but maybe a better idea is to find a FASTA processing toolkit. I prefer tools built by someone else because they have been tested for correctness and probably less buggy than a personal custom solution (no matter how small the task looks to an expert, we're not all experts). Extracting sequences from a multi-FASTA should be simple, and Google took me to the SeqAnswers thread immediately, where they link a binary tool from UCSC that does just this task.

ADD REPLYlink written 6.8 years ago by karl.stamm3.6k
>ASDFG
 ABCDEFGS

for such lines grep -A 1 "_ENZ" temp.fasta will not work.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by always_learning1.0k
0
gravatar for mayingfei
6.6 years ago by
mayingfei10
mayingfei10 wrote:

Well, if you have a lot of sequences with different name that need to be extracted from the dataset. You dont have to use python.
for example: you have a .txt file containing the names of the sequencing you want to extract a.txt

111111111 222222222 333333333

Then you have a datasets like this b.fas

11111111 ATCGATCTATCG 22222222 ATCCCCCCCCC 33333333 GGGGGGGGGG 44444444 AAAAAAAAAAA

you can use this code to do that :

for i in cat a.txt; do awk 'BEGIN{RS=">"}/'$i'/{print ">"$0 }' b.fas>> output;done

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by mayingfei10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1574 users visited in the last hour