Question: Help with Biopython
0
gravatar for adeel.maliks20
3.5 years ago by
Estonia
adeel.maliks2010 wrote:

I am very new to programming in python. I have protein fasta files of species of plants.

I would like to filter them based on the number of amino acids each sequence contain. Criteria is those sequences >20 amino acids.

I am able to get the amino acids bigger than 20 with the resources on biopython cookbook. However,when i try to write them on the file. It gives me error. I am unable to reproduce it. Moreover, I would also like to have IDs of each sequence in the output file. Please help me!

Code:

import Bio
from Bio import SeqIO
for s_record in SeqIO.parse('arabidopsis_thaliana_proteome.ath.tfa','fasta'):
    name = s_record.id
    seq = s_record.seq
    seqLen = len(s_record)
    if seqLen >20:
        desired_proteins=seq
        output_file=SeqIO.write(desired_proteins, "filtered.fasta","fasta")
output_file

Thank you in advance :)

rna-seq biopython • 1.5k views
ADD COMMENTlink modified 12 days ago by Biostar ♦♦ 20 • written 3.5 years ago by adeel.maliks2010
1

you want to do seqLen = len(seq) not s_record. You also want to keep a list of the good ones, as WouterDeCoster's example specifies, and write s_record to the file, not seq. seq would just give you the sequence string, and you want to maintain headers in the format of a fasta file.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by st.ph.n2.5k

Actually, len(record) is the same as len(record.seq) and there is no need to convert it to a string.

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k

Thank you :) Understandable now!

ADD REPLYlink written 3.5 years ago by adeel.maliks2010
2
gravatar for WouterDeCoster
3.5 years ago by
Belgium
WouterDeCoster43k wrote:

It's always helpful if you specify which error you got, although to me it's rather obvious what's going wrong here.

See the following (untested!) proposed solution:

from Bio import SeqIO
goodproteins = []
for record in SeqIO.parse('arabidopsis_thaliana_proteome.ath.tfa','fasta'):
    if len(record.seq) >20:
        goodproteins.append(record)
SeqIO.write(goodproteins, "filtered.fasta","fasta")

Which could also be written as:

from Bio import SeqIO
SeqIO.write([record for record in SeqIO.parse('arabidopsis_thaliana_proteome.ath.tfa','fasta') if len(record.seq) >20], "filtered.fasta","fasta")

Which is beautifully short but harder to understand and less readable :)

Let me know if something is unclear or you get more errors!

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by WouterDeCoster43k

It works wonderfully well :) Thanks so much!

ADD REPLYlink written 3.5 years ago by adeel.maliks2010

Hi,

Is it possible to add a feature "the script takes input as directory containing all the files" and produce a new filtered directory ?

ADD REPLYlink written 3.5 years ago by adeel.maliks2010
1

Sure :)

I could write it for you or point you in the right direction... I'll do the second: have a look at the glob module

import glob
unfilteredfiles = glob.glob('*.tfa")

Then you need a for loop on the unfilteredfiles. It would be better to write functions and not just all code sequentially.

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k

Hi Wouter,

I was trying to make my code command line friendly, so it can input and output files from user. I used sys module. It is giving the desired output but i still want to get a second opinion if its a right approach :)

Here's a code:

import sys import Bio from Bio import SeqIO input_file=sys.argv[1] #input fasta file filtered_file=sys.argv[2] #Output fasta file goodproteins=[] with open(filtered_file,'w') as output_handle: for record in SeqIO.parse(open(input_file),'fasta'): if len(record.seq)>20:
goodproteins.append(record) SeqIO.write(goodproteins,output_handle,"fasta")

Moreover, I want to know the total number of sequences in the file prior filtering. Any advice on that would be perfect.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by adeel.maliks2010

Sorry about the messed up code in above message;

import sys
import Bio
from Bio import SeqIO
input_file=sys.argv[1] #input fasta file
filtered_file=sys.argv[2]   #Output fasta file
goodproteins=[]
with open(filtered_file,'w') as output_handle:
    for record in SeqIO.parse(open(input_file),'fasta'):
        if len(record.seq)>20:   
            goodproteins.append(record)
SeqIO.write(goodproteins,output_handle,"fasta")
ADD REPLYlink written 3.5 years ago by adeel.maliks2010
1

Try this although i am sure there is a more elegant way of doing this -

import sys
import Bio
from Bio import SeqIO
input_file=sys.argv[1] #input fasta file
filtered_file=sys.argv[2]   #Output fasta file

count = 0
for record in SeqIO.parse(input_file, "fasta"):
    count = count + 1
print("There are " + str(count) + " sequences in " + input_file)

goodproteins=[]
with open(filtered_file,'w') as output_handle:
    for record in SeqIO.parse(open(input_file),'fasta'):
        if len(record.seq)>300:
            goodproteins.append(record)
SeqIO.write(goodproteins,filtered_file,"fasta")

count2 = 0
for record in SeqIO.parse(filtered_file, "fasta"):
    count2 = count2 + 1
print("There are now " + str(count2) + " sequences in " + filtered_file + " after filtering")
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by bongok40
2

Hi bongok,

Thanks for contributing, but it's completely ridiculous to parse the fasta file three times to count this. You could easily do this all in one loop which will make the script much quicker. Also I don't think you should import Bio as you are importing SeqIO which is all you need. No need to make the runtime longer, right? ;-)

UPDATED

Note that you could also determine the name for the filtered file based on the input_file to save time when executing the command, e.g. outputname = "filtered_" + sys.argv[1]

Do you still want the script to run on all fasta files in a directory? We could also modify it that when sys.argv[1] is a fasta file, run it on the fasta file. When sys.argv[1] is a directory, run it on all fasta files in the directory. Not too hard, good coding exercise for you ;)

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by WouterDeCoster43k
1

Hi!

Thanks, your code is definitely better! As you can see, i am still learning. :)

ADD REPLYlink written 3.5 years ago by bongok40

Hi WouterDeCoster,

Your solution to this problem is memory efficient and quicker. Something is wrong after the addition of counting the number of sequences. For some reasons, it produces only the first sequence from the file and output is only 1KB with one sequence in it.

Plus, on each file it says "Filtering retained 1 out {} sequences in this file." The answer for countGood is always one various files.

I tried to debug it on paper but i am not able to find what's wrong :(

I still want to add the feature of folder as input and produce an output. However, some of my files are with .csv extension and some with .tfa

ADD REPLYlink written 3.5 years ago by adeel.maliks2010
1

Looks like I made an embarrassing little mistake, my indentation wasn't correct. Although it surprises me that this didn't lead to an error.

Anyway, I've put the code in a Github gist now, easier to edit, less chance of errors ;)

Can you check it's alright now?

We can search for both fasta and csv files, so that's not a huge problem.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by WouterDeCoster43k

Works perfectly fine now :)

Thanks

ADD REPLYlink written 3.5 years ago by adeel.maliks2010
1

Alright, if you need help to make it work on directories let me know, but try to fix it yourself first - you'll learn more.

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k

I am currently doing that using Youtube glob tutorials :D

If i get stuck, i will shout for help :D

I am really thankful for your help :)

ADD REPLYlink written 3.5 years ago by adeel.maliks2010

Hi,

I worked on it and managed to get all the files in a directory. However, i am having hard time restructuring the code to make it work in my scenario.

Here's the code;

import glob

def main():
    types=('*.csv','*.tfa')
    files_grabbed= []
    for afile in types:
        files_grabbed.extend(glob.glob(afile))

    print files_grabbed;
    return 0;



if (__name__ == "__main__"):
    main()

Please help me :)

ADD REPLYlink written 3.5 years ago by adeel.maliks2010

Why are you returning 0? I think you should return the list of files to work with.

I updated my gist with a suggestion how I would handle this. I usually have a master function calling the other functions. The first function called would perform all "sanity checking" to ensure the input is as expected (mostly important if you write scripts which are also used by other people).

Just as a side note: make it a habit of writing 'print("Hello world")' instead of 'print "Hello world"`. The second form won't work under python3 (which will in the near future replace python2).

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k

Hi,

About returning 0, i didn't have idea to return files and talk with other functions through these files as you did in your modified version of code.

How the input at the command line would look like? When i am feeding the program "directory" as input... its returning "Invalid Input" and when i am giving it a fasta file as input, it gives me error "Name Error: global name "Filtered_file is not defined"

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by adeel.maliks2010
1

Gha, my code wasn't finished - just a hint on how you could do things. For running on a directory I chose the keyword 'dir' but you can easily choose that, right?

If you are going to run it on a directory you cannot have one output name (filtered_file variable). You need to base your output file based on the input file. e.g. if input file is my_beautiful_fasta.tfa you could for example chose filtered_my_beautiful_fasta.tfa or my_beautiful_fasta_filtered.tfa

Passing data through functions is quite useful, allows you to split up each task by a separate function. Easy for thinking, easy for designing, easy for modifying.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by WouterDeCoster43k

Finally i am able to solve the problem!

Thank you so much everyone, especially WouterDeCoster for contribution!

ADD REPLYlink written 3.5 years ago by adeel.maliks2010

Thanks Bongok for the message.

The difference i find you used "filtered_file" instead of "output_handle". I am thinking there could be logic behind it.

I am newbie, can you explain please?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by adeel.maliks2010
1

I have modified the code above, now it should also give you number of sequences before and after filtering.

ADD REPLYlink written 3.5 years ago by bongok40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1266 users visited in the last hour