Question: A code for splitting FASTA file, No output is written for the file
0
gravatar for AsoInfo
3 months ago by
AsoInfo220
Bonn, Germany
AsoInfo220 wrote:

I have written code for splitting FASTA into desired number of sequences and desired number of files. But it doesn't write output for the second file or the last file. I couldn't figure it out.

The code is as follows:

file_open=open("sequence.txt","r") # Input file
count_file=2 # Number of output files
count_sequence=5 # Sequences in each output file

sequence=0 # Count total sequences until count_sequences
for i in range(count_file,0,-1): # Loop for file names generation
    sequence_file=open("sequence"+str(i)+".fasta","w") # Output file
    for line in file_open: # Reading the sequences one by one
        if line.startswith(">"):
            fasta=""
            for data_line in file_open:
                if data_line.startswith("\n"):
                    break
                else:
                    fasta=fasta+data_line

            print(sequence_file)
            print(line+fasta)

            sequence_file.write(line+fasta+"\n") # Writing the FASTA sequences to file
            sequence=sequence+1 # Increment the count of the sequence 

        if sequence==count_sequence: # If number of sequences are equal to the desired number of sequences
            sequence=0 # reset the counter
            break

In this case, sequence1.fasta is empty but sequence2.fasta have the first five sequences.

I know that there are many tools mentioned in the forum that do the desired thing but I want the specific format and output extension, that's why I wrote this but couldn't run it.

Thanks!

splitting fasta python • 251 views
ADD COMMENTlink modified 3 months ago by a.zielezinski7.0k • written 3 months ago by AsoInfo220

the loop logic for lines is not right. only one level is needed

ADD REPLYlink written 3 months ago by shenwei3563.1k

I modified the code and it worked....

ADD REPLYlink written 3 months ago by AsoInfo220

first I think you need to make an index from the fasta.

file_open = open (sys.argv[1],"r")  # Input file
seqs = {}
seq = ""
for line in file_open:

    line = line.rstrip('\n')

    if line.startswith('>'):

        if seq:

            seqs[name] = seq 
            seq = ""
        name = line

    else:

        seq = seq + line
#the last sequence
seqs[name] = seq 

file_open.close()

Then, you can print its elements in loop for i in range(int(count_file)):

ADD REPLYlink modified 3 months ago • written 3 months ago by Buffo460
1

Just preference but this would be nicer, the with block is so handy (untested):

with open(sys.argv[1],"r") as file_open:  # Input file
    seqs = {}
    seq = ""
    for line in file_open:
        line = line.rstrip('\n')
        if line.startswith('>'):
            if seq:
            seqs[name] = seq 
            seq = ""
        name = line
    else:
        seq = seq + line
#the last sequence
seqs[name] = seq
ADD REPLYlink modified 3 months ago • written 3 months ago by jrj.healey1.9k

It worked fine in my computer with python 3.4.2 (windows 8.1). However, your code

data_line.startswith("\n"):

means that sequences in the file must end with empty line. I think it does not match the specification of the FASTA format. I recommend you to check your input file "sequence.txt". (Does it really have more than 5 sequences?)

But I recommend much stronger to change the code compatible for any FASTA formatted files.

In addition,

for line in file_open: # Reading the sequences one by one

for data_line in file_open:

I don't like to use same file object in inner loop (But I'm not python person. It may be a common way in python). (It may be the same thing which shenwei356 mentioned.)

ADD REPLYlink modified 3 months ago • written 3 months ago by yamule40

Could you explain what your desired format / output extension is? It might be easier to use seqkit to split the sequence and then write a much simpler shell script to do the post-processing?

ADD REPLYlink written 3 months ago by James Ashmore1.7k

I modified the code and it worked perfectly, thanks for your help.

ADD REPLYlink written 3 months ago by AsoInfo220
3
gravatar for a.zielezinski
3 months ago by
a.zielezinski7.0k
a.zielezinski7.0k wrote:

I followed the logic of your script and tweaked it a little. Now, the script should do what you wanted. However, I just want to let you know that there are other (more efficient and less complicated) - ways to implement this task in Python.

file_open = open("sequence.txt","r")
MAX_FILES = 2    # Number of output files
SEQ_PER_FILE = 5 # Sequences in each output file

sequence_n = 0 # Counter for sequences
file_n = 0  # Counter for output files

for line in file_open:
    if not line.strip(): continue
    if line.startswith('>'):
        if sequence_n:
            if sequence_n%SEQ_PER_FILE == 0:
                file_n += 1
                oh = open('sequence'+str(file_n)+'.fasta', 'w')
                oh.write(fasta)
                oh.close()
                if file_n == MAX_FILES: 
                    break
                fasta = line
            else:
                fasta += line
        else:
            fasta = line
        sequence_n += 1
    else:
        fasta += line

if sequence_n%SEQ_PER_FILE:
    oh = open('sequence'+str(file_n+1)+'.fasta', 'w')
    oh.write(fasta)
    oh.close()
ADD COMMENTlink written 3 months ago by a.zielezinski7.0k

I have tried it and it worked perfectly...Thanks!

ADD REPLYlink written 3 months ago by AsoInfo220

Just curious if this was an assignment? You had done the right thing by posting your own code along with your question.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax30k

No this was not an assignment. Just a small project stuff.

ADD REPLYlink written 12 weeks ago by AsoInfo220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 461 users visited in the last hour