Question: Filter DNA sequence in FASTA file by minimum length and redirect sequences that don't meet condition to new file using python
0
gravatar for asidhu
7 days ago by
asidhu10
asidhu10 wrote:

I am using a FASTA file with many sequences that I wish to filter. An example of the file is as follows:

>SRR5533383:0::12:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGAGAAAGTTTGCGGCTCAACCGTAAAATTG$
>SRR5533383:0::19:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTAGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG$
>SRR5533383:0::17:0:0:0:
GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACATGTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTG$

Where the line that starts with ">" is the header and the line below it is a sequence. For each sequence there is a header. I want to be able to filter the sequence based on a minimum length (150 base pairs) that they must have in order to be kept in the main file. If the sequence is < 150 then I want it to be removed from the main file and the header of the removed sequence should go to another file ("bad reads file") where the reason it was excluded is included. Such that the output of the "bad reads" file looks like:

>SRR5533383:0::17:0:0:0:
"Excluded because too short"

I'm trying to use the following code to filter the sequences but I can't get it to work:

sequences = {}

with open("seq.fasta") as file:
    header = None
    data = ''

    for line in file:
        if line.startswith(">"):
            if header and data:
                sequences[header] = data
            data = ''
            header = line.rstrip()
        else:
            data += line.rstrip()

    if header and data:
        sequences[header] = data  # deal with the last one in the file

for header, data in sequences.items():
    print("{}; {}".format(header, data))

#Creates "sequences" in header:sequences 

seq = np.asarray(list(data))
print(seq)

#Minimum threshold
min_length = 150
for line in data:
    if line >= min_length:
        print("Sequences meeting the minimum threshold are kept in paired.fasta")
    else:
        data.removeline with open("output.log", "w")
    print("%i removed because too short" % (header))
sequencing python • 138 views
ADD COMMENTlink modified 7 days ago by Heather Ward50 • written 7 days ago by asidhu10
5
gravatar for Heather Ward
7 days ago by
Heather Ward50
Guelph, ON
Heather Ward50 wrote:

You've got some problems in there - for example, seq = np.asarray(list(data)) seq is only going to contain the last line of the fasta file, not every sequence. Try going back through your code line by line and think about what the data variable gets filled with during each iteration, and why that might be so.

Here is one way to do what you want - there are many. You were getting there with the first half of your program. Keep practicing!

#!/usr/bin/env python3

sequences = {}
good_reads = []
bad_reads = []

with open("seq.fasta", "r") as file:
        for line in file:
                line=line.rstrip()
                if (line[0] == ">"):
                        header = line
                        sequences[header] = ""
                else:
                        data = line
                        sequences[header] += data

min_length=150

# figure out which reads are good/bad
for header in sequences.keys():
        if (len(sequences[header]) > min_length):
                good_reads.append(header)
        else:
                bad_reads.append(header)

# write good reads
with open("good.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))

# write bad reads
with open("bad.fasta", "w+") as bad_out:
        for header in bad_reads:
                bad_out.write("{}\nExcluded because too short\n".format(header))
ADD COMMENTlink written 7 days ago by Heather Ward50

Thank you so much! This is exactly what I was looking for. I'm still new to python and practicing :)

Just a question, if I wanted to edit the seq.fasta file such that only the good reads remained in it, is that possible? In your solution there is a new good_reads.fasta which works perfectly but if I wanted to continue to use seq.fasta with just the good reads in it could I do that? Could it be possible to overwrite the seq.fasta file like this:

#write good reads
with open("seq.fasta", "w+") as good_out:
        for header in good_reads:
                good_out.write("{}\n{}\n".format(header, sequences[header]))
ADD REPLYlink modified 7 days ago • written 7 days ago by asidhu10

Yes, this would work, however it's generally bad practice because doing this will destroy your input file.

It would be best to save your output as a new file in case something goes wrong in your program and your output isn't what you'd expect. For example, imagine in the codeblock you posted there that you accidentally wrote for header in bad_reads. If you ran this it will overwrite your input file with only the bad reads - this is obviously an easy fix in the python program to change it to good_reads, except that you no longer have any good reads in your seq.fasta file because you overwrote it.

At the very least, if you want the output name to be identical, you should make a new directory and output the filtered file there: with open("outputs/seq.fasta", "w+") as good_out:

ADD REPLYlink modified 7 days ago • written 7 days ago by Heather Ward50
3
gravatar for rizoic
7 days ago by
rizoic190
rizoic190 wrote:

In case it is a requirement to do the solution without any additional packages then you can use the solution by Heather. However when parsing standard file formats it is always recommended to use specialized packages for handling them. You can use the Biopython package in python for this. An example for filtering with it is as follows

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", format = "fasta"):
    if len(seq_record) < 150:
        print(seq_record.description)

You can find more details here

ADD COMMENTlink modified 7 days ago • written 7 days ago by rizoic190

Thank you for this! I was instructed not to use Biopython otherwise this would have been the perfect method!

ADD REPLYlink written 7 days ago by asidhu10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour