Question: Filter DNA sequence in FASTA file by minimum length and redirect sequences that don't meet condition to new file using python
gravatar for asidhu
24 months ago by
asidhu10 wrote:

I am using a FASTA file with many sequences that I wish to filter. An example of the file is as follows:


Where the line that starts with ">" is the header and the line below it is a sequence. For each sequence there is a header. I want to be able to filter the sequence based on a minimum length (150 base pairs) that they must have in order to be kept in the main file. If the sequence is < 150 then I want it to be removed from the main file and the header of the removed sequence should go to another file ("bad reads file") where the reason it was excluded is included. Such that the output of the "bad reads" file looks like:

"Excluded because too short"

I'm trying to use the following code to filter the sequences but I can't get it to work:

sequences = {}

with open("seq.fasta") as file:
    header = None
    data = ''

    for line in file:
        if line.startswith(">"):
            if header and data:
                sequences[header] = data
            data = ''
            header = line.rstrip()
            data += line.rstrip()

    if header and data:
        sequences[header] = data  # deal with the last one in the file

for header, data in sequences.items():
    print("{}; {}".format(header, data))

#Creates "sequences" in header:sequences 

seq = np.asarray(list(data))

#Minimum threshold
min_length = 150
for line in data:
    if line >= min_length:
        print("Sequences meeting the minimum threshold are kept in paired.fasta")
        data.removeline with open("output.log", "w")
    print("%i removed because too short" % (header))
sequencing python • 1.6k views
ADD COMMENTlink modified 24 months ago by Heather Ward50 • written 24 months ago by asidhu10
gravatar for Heather Ward
24 months ago by
Heather Ward50
Guelph, ON
Heather Ward50 wrote:

You've got some problems in there - for example, seq = np.asarray(list(data)) seq is only going to contain the last line of the fasta file, not every sequence. Try going back through your code line by line and think about what the data variable gets filled with during each iteration, and why that might be so.

Here is one way to do what you want - there are many. You were getting there with the first half of your program. Keep practicing!

#!/usr/bin/env python3

sequences = {}
good_reads = []
bad_reads = []

with open("seq.fasta", "r") as file:
        for line in file:
                if (line[0] == ">"):
                        header = line
                        sequences[header] = ""
                        data = line
                        sequences[header] += data


# figure out which reads are good/bad
for header in sequences.keys():
        if (len(sequences[header]) > min_length):

# write good reads
with open("good.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))

# write bad reads
with open("bad.fasta", "w+") as bad_out:
        for header in bad_reads:
                bad_out.write("{}\nExcluded because too short\n".format(header))
ADD COMMENTlink written 24 months ago by Heather Ward50

Thank you so much! This is exactly what I was looking for. I'm still new to python and practicing :)

Just a question, if I wanted to edit the seq.fasta file such that only the good reads remained in it, is that possible? In your solution there is a new good_reads.fasta which works perfectly but if I wanted to continue to use seq.fasta with just the good reads in it could I do that? Could it be possible to overwrite the seq.fasta file like this:

#write good reads
with open("seq.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))
ADD REPLYlink modified 24 months ago • written 24 months ago by asidhu10

Yes, this would work, however it's generally bad practice because doing this will destroy your input file.

It would be best to save your output as a new file in case something goes wrong in your program and your output isn't what you'd expect. For example, imagine in the codeblock you posted there that you accidentally wrote for header in bad_reads. If you ran this it will overwrite your input file with only the bad reads - this is obviously an easy fix in the python program to change it to good_reads, except that you no longer have any good reads in your seq.fasta file because you overwrote it.

At the very least, if you want the output name to be identical, you should make a new directory and output the filtered file there: with open("outputs/seq.fasta", "w+") as good_out:

ADD REPLYlink modified 24 months ago • written 24 months ago by Heather Ward50
gravatar for rizoic
24 months ago by
rizoic230 wrote:

In case it is a requirement to do the solution without any additional packages then you can use the solution by Heather. However when parsing standard file formats it is always recommended to use specialized packages for handling them. You can use the Biopython package in python for this. An example for filtering with it is as follows

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", format = "fasta"):
    if len(seq_record) < 150:

You can find more details here

ADD COMMENTlink modified 24 months ago • written 24 months ago by rizoic230

Thank you for this! I was instructed not to use Biopython otherwise this would have been the perfect method!

ADD REPLYlink written 24 months ago by asidhu10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1032 users visited in the last hour