Python script to remove sequences from a fasta file
1
1
Entering edit mode
3.4 years ago
T_18 ▴ 40

Dear all,

I am trying to emove fasta sequences from a fasta file based on a txt file with ID's that need to be removed. The fasta file contains 44306447 reads (RNAseq data). I have tried to use the following script from another thread: How To Remove Certain Sequences From A Fasta File

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
for seq in fasta_sequences:
**nam = seq.id
**nuc = str(seq.seq)
**if nam not in remove and len(nuc) > 0:
SeqIO.write([seq], f, "fasta")


But unfortunately using grep ">" does result in the exact same number of reads, so I am concluding there must be a mistake in the code somewhere but I cannot figure out where. Is someone able to help me out here to get a script running that removes reads from a fasta file based on a text file with headers and that can handle LARGE files?

Thanks!

Python fasta • 4.4k views
0
Entering edit mode

In your remove file, do the sequence ids still have the fasta ">" prepended?

0
Entering edit mode

Why do you use **name, **nuc, **if in the last for loop? Is that meant to be there?

0
Entering edit mode

Hello,

do you really need your own python script? There are several program out there, that do this task for your, e.g. seqkit

\$ seqkit -n -v -f ids_to_remove.txt input.fasta > output.fasta


fin swimmer

0
Entering edit mode

While I agree with "don't reinvent the wheel", I also appreciate when people try and solve such things by their own - how else are you supposed to learn?

1
Entering edit mode

Yes, you're right. As I don't know why the OP likes to use python I asked for it. There might be several reasons. One is he/she likes to learn scripting, another is he/she doesn't know it already exists a tool for it. So I made a suggestion and one can choose to use it or not. Even if one like to learn scripting, it is good to know that there other tools that can do the job, so one could use it to validate it's own program.

Some comments on the code of the OP for learning purpose:

remove = set()


Why using a set? Can there be duplicates in the input file?

with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":


This can be shorten, by reading the whole file at once and split by line.

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')


There is no need to use open() here, BioPython do this for you.

**nam = seq.id
**nuc = str(seq.seq)
**if nam not in remove and len(nuc) > 0:


As other mention before, what are the ** for? Also copy the id and seq into new variable can be ommited here. In my opinion this doesn't help for code readability (personal opinion). Can there be sequences with no sequence? That would be weired. I guess you can remove the check for it.

To summarize the script could look like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2]  # Input wanted file, one gene name per line
result_file = sys.argv[3]  # Output fasta file

with open(result_file, "w") as f, open(remove_file, "r") as r:

for seq in SeqIO.parse(fasta_file, 'fasta'):
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")


fin swimmer

1
Entering edit mode

Sorry, I didn't want to sound harsh. I agree with that sentiment, OP might not know about available tools, so of course suggesting seqtk makes sense!

Concerning your remark about the use of a set. We don't expect duplicates, but using a list, as you suggest, requires linear search to check for each read name. In order to decrease that, using a set makes the most sense since it uses hashing. A dict would work as well, but then there's the memory overhead associated with that. A frozenset would probably be the best solution.

0
Entering edit mode

Didn't know that set using hashing. Thanks for pointing out!

0
Entering edit mode

Thanks, all. This indeed solved the issue and it is running fine now!

And why using script, it's fun to get it running!

1
Entering edit mode

Hello T_18,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

0
Entering edit mode

Done, thanks.. Never did this before.

0
Entering edit mode

Which comment above resolved the issue? I can them move that to an answer so you can accept it.

0
Entering edit mode

I used the fix of goodez. Although reading the script of finswimmer must solve the issue also (but did not test).

0
Entering edit mode

Hi,

I tried to run this script but it returns this error:

    fasta_file = sys.argv[1]  # Input fasta file
IndexError: list index out of range


should the fasta file be of special format? mine starts like this:

  >monCan3F9-B-G1013-Map20
ATACAACCACTAACCAAACAACATATAGACTGGACATCACATAGAAGAAACGTTGTCAAA
AAAAGTATCGGAGGCATCTGGTCCAAAATCGCGAGTAAGGGAGTAAGATATACATATGAT

1
Entering edit mode

dear mthm, you're posting a question as an answer to a 2.6y old post. You'd rather post your question as a separate question.

Spoiler: You might want to use your fasta file as input argument to your script ;-)

python script.py seq.fna

0
Entering edit mode

thanks I realized my mistake after posting it, but was too late:) I thought this issue is too minor to be posted separately, sorry.

1
Entering edit mode
3.4 years ago
goodez ▴ 510

Can you post a couple lines of the "remove" file.

I think changing the last loop might fix it.

with open(result_file, "w") as f:
for seq in fasta_sequences:
nam = str( seq.id)
nam = nam.strip()
nuc = str(seq.seq)
if nam not in remove and len(nuc) > 0:
SeqIO.write([seq], f, "fasta")


I'm assuming the main problem was a difference in nam and remove even when they are meant to be the same.