Question

help with deleting multiple fasta sequences using biopython

0

Entering edit mode

4.7 years ago

Bacteria_forever ▴ 10

Hi all, I need a trained python eye for this :)

I need to remove 100's of genes from a proteome file contains 1000s genes. Obviously I do not want to do it manually. I have pulled the python code pasted below from somewhere, which is a few years old. It is supposed to do what I want, but it does not. It just copies all the files from the original file to the output file, ignoring the remove.file. This code requires 3 files which I supplied. File 1; "123.fasta" - the file with my original unedited proteome, file 2; "remove.txt" - the file with the list of gene ID's to be removed. File 3. "new.fasta" - the output file with the edited proteome minus the genes listed in the remove.txt file. Ideally, I would like the code to identify the genes in "123.fasta" by the fasta format sequence ID (eg. >sequence1, >sequence2 etc).

This is the code:

import Bio
from Bio import SeqIO
import sys

fasta_file = ("123.fasta")
remove_file = ("remove.txt")
result_file = ("new.fasta")
remove = set (">")
with open(remove_file) as f:
     for line in f:
         line = line.strip()
         if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file), "fasta")
with open(result_file, "w") as f:
   for seq in fasta_sequences:
       nam = str()
       nam = nam.stripseq.id)
       nuc = str(seq.seq)
       SeqIO.write([seq], f, "fasta")

As I said, no matter what I tweak, it just copies and pastes all of the 123.fasta file into the output file, no deletions. Any of the python people see what may be the problem? I am not a trained python operator , just using it for my work.

genome sequence sequencing python • 1.8k views

ADD COMMENT • link updated 4.7 years ago by Joe 22k • written 4.7 years ago by Bacteria_forever ▴ 10

0

Entering edit mode

I have pulled the python code pasted below from somewhere, which is a few years old. It is supposed to do what I want, but it does not.

Since this question is not about python code you wrote consider faSomeRecords utility from Jim Kent (LINK). After downloading the file add execute permissions (chmod u+x faSomeRecords). Use as follows

usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

ADD REPLY • link 4.7 years ago by GenoMax 154k

score 2 · Answer 1 · 2021-03-09

On closer inspection there is a lot wrong with the code in your opening post, so its easier just to start from scratch:

from Bio import SeqIO
import sys


with open(sys.argv[1], 'r') as rfh:
    to_remove = [line.strip() for line in rfh]

to_keep = []
for record in SeqIO.parse(sys.argv[2], "fasta"):
    if record.id not in to_remove:
        to_keep.append(record)

SeqIO.write(to_keep, sys.argv[3], "fasta")

Usage: `python scriptname.py list_to_remove.txt input.fasta output.fasta`

Not the worlds most efficient script, but works well enough.

If your IDs aren't simple strings that correspond directly, (e.g. they contain the > character etc) you'll need to do a bit more work to generate the to_remove list.

score 0 · Answer 2 · 2021-03-08

0

Entering edit mode

4.7 years ago

JC 13k

The problem is that you are not checking if the sequence is in the black-list:

for seq in fasta_sequences:
       nam = str()
       nam = nam.stripseq.id()
       if nam in remove:
           continue #sequence is black-listed
       nuc = str(seq.seq)
       SeqIO.write([seq], f, "fasta")

ADD COMMENT • link 4.7 years ago by JC 13k

0

Entering edit mode

Hmmmm, thank you, But your amendment returns this:

Traceback (most recent call last): File "biostarans.py", line 19, in <module> nam = nam.stripseq.id() AttributeError: 'str' object has no attribute 'stripseq' ??

ADD REPLY • link 4.7 years ago by Bacteria_forever ▴ 10

0

Entering edit mode

This is probably just due to a known rendering issue with Biostars. There should be parenthesis between nam.strip and seq.id so: nam.strip () seq.id()

ADD REPLY • link 4.7 years ago by Joe 22k

0

Entering edit mode

that is your code, I just added the check if nam in remove

ADD REPLY • link 4.7 years ago by JC 13k

0

Entering edit mode

Yeah seems weird, the code is not coming out right when I submit in message above, hence your amendment didn't work. Sorry. There are brackets missing, which will not reproduce even I post them here now. But still, when I fix it in my terminal with your addition it still just straight up copying and pasting without editing out the remove list. Damn frustrating!

ADD REPLY • link 4.7 years ago by Bacteria_forever ▴ 10

Usage: python scriptname.py list_to_remove.txt input.fasta output.fasta

Usage: `python scriptname.py list_to_remove.txt input.fasta output.fasta`