Question: Python script to remove sequences from a fasta file
1
gravatar for T_18
23 months ago by
T_1840
T_1840 wrote:

Dear all,

I am trying to emove fasta sequences from a fasta file based on a txt file with ID's that need to be removed. The fasta file contains 44306447 reads (RNAseq data). I have tried to use the following script from another thread: How To Remove Certain Sequences From A Fasta File

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        **nam = seq.id
        **nuc = str(seq.seq)
        **if nam not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

But unfortunately using grep ">" does result in the exact same number of reads, so I am concluding there must be a mistake in the code somewhere but I cannot figure out where. Is someone able to help me out here to get a script running that removes reads from a fasta file based on a text file with headers and that can handle LARGE files?

Thanks!

python fasta • 2.4k views
ADD COMMENTlink modified 23 months ago • written 23 months ago by T_1840

In your remove file, do the sequence ids still have the fasta ">" prepended?

ADD REPLYlink written 23 months ago by cschu1812.3k

Why do you use **name, **nuc, **if in the last for loop? Is that meant to be there?

ADD REPLYlink written 23 months ago by goodez480

Hello,

do you really need your own python script? There are several program out there, that do this task for your, e.g. seqkit

$ seqkit -n -v -f ids_to_remove.txt input.fasta > output.fasta

fin swimmer

ADD REPLYlink modified 23 months ago • written 23 months ago by finswimmer13k

While I agree with "don't reinvent the wheel", I also appreciate when people try and solve such things by their own - how else are you supposed to learn?

ADD REPLYlink written 23 months ago by cschu1812.3k
1

Yes, you're right. As I don't know why the OP likes to use python I asked for it. There might be several reasons. One is he/she likes to learn scripting, another is he/she doesn't know it already exists a tool for it. So I made a suggestion and one can choose to use it or not. Even if one like to learn scripting, it is good to know that there other tools that can do the job, so one could use it to validate it's own program.

Some comments on the code of the OP for learning purpose:

remove = set()

Why using a set? Can there be duplicates in the input file?

 

with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

This can be shorten, by reading the whole file at once and split by line.

 

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

There is no need to use open() here, BioPython do this for you.

 

**nam = seq.id
**nuc = str(seq.seq)
**if nam not in remove and len(nuc) > 0:

As other mention before, what are the ** for? Also copy the id and seq into new variable can be ommited here. In my opinion this doesn't help for code readability (personal opinion). Can there be sequences with no sequence? That would be weired. I guess you can remove the check for it.

 

To summarize the script could look like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2]  # Input wanted file, one gene name per line
result_file = sys.argv[3]  # Output fasta file

with open(result_file, "w") as f, open(remove_file, "r") as r:
    remove = r.read().split("\n")

    for seq in SeqIO.parse(fasta_file, 'fasta'):
        if seq.id not in remove:
            SeqIO.write(seq, f, "fasta")

fin swimmer

ADD REPLYlink modified 23 months ago • written 23 months ago by finswimmer13k
1

Sorry, I didn't want to sound harsh. I agree with that sentiment, OP might not know about available tools, so of course suggesting seqtk makes sense!

Concerning your remark about the use of a set. We don't expect duplicates, but using a list, as you suggest, requires linear search to check for each read name. In order to decrease that, using a set makes the most sense since it uses hashing. A dict would work as well, but then there's the memory overhead associated with that. A frozenset would probably be the best solution.

ADD REPLYlink written 23 months ago by cschu1812.3k

Didn't know that set using hashing. Thanks for pointing out!

ADD REPLYlink written 23 months ago by finswimmer13k

Thanks, all. This indeed solved the issue and it is running fine now!

And why using script, it's fun to get it running!

ADD REPLYlink written 23 months ago by T_1840
1

Hello T_18,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink modified 23 months ago • written 23 months ago by finswimmer13k

Done, thanks.. Never did this before.

ADD REPLYlink written 23 months ago by T_1840

Which comment above resolved the issue? I can them move that to an answer so you can accept it.

ADD REPLYlink written 23 months ago by genomax84k

I used the fix of goodez. Although reading the script of finswimmer must solve the issue also (but did not test).

ADD REPLYlink written 23 months ago by T_1840
1
gravatar for goodez
23 months ago by
goodez480
United States
goodez480 wrote:

Can you post a couple lines of the "remove" file.

I think changing the last loop might fix it.

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nam = str( seq.id)
        nam = nam.strip()
        nuc = str(seq.seq)
        if nam not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

I'm assuming the main problem was a difference in nam and remove even when they are meant to be the same.

ADD COMMENTlink modified 23 months ago • written 23 months ago by goodez480
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1367 users visited in the last hour