help with deleting multiple fasta sequences using biopython
2
0
Entering edit mode
3.1 years ago

Hi all, I need a trained python eye for this :)

I need to remove 100's of genes from a proteome file contains 1000s genes. Obviously I do not want to do it manually. I have pulled the python code pasted below from somewhere, which is a few years old. It is supposed to do what I want, but it does not. It just copies all the files from the original file to the output file, ignoring the remove.file. This code requires 3 files which I supplied. File 1; "123.fasta" - the file with my original unedited proteome, file 2; "remove.txt" - the file with the list of gene ID's to be removed. File 3. "new.fasta" - the output file with the edited proteome minus the genes listed in the remove.txt file. Ideally, I would like the code to identify the genes in "123.fasta" by the fasta format sequence ID (eg. >sequence1, >sequence2 etc).

This is the code:

import Bio
from Bio import SeqIO
import sys

fasta_file = ("123.fasta")
remove_file = ("remove.txt")
result_file = ("new.fasta")
remove = set (">")
with open(remove_file) as f:
     for line in f:
         line = line.strip()
         if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file), "fasta")
with open(result_file, "w") as f:
   for seq in fasta_sequences:
       nam = str()
       nam = nam.stripseq.id)
       nuc = str(seq.seq)
       SeqIO.write([seq], f, "fasta")

As I said, no matter what I tweak, it just copies and pastes all of the 123.fasta file into the output file, no deletions. Any of the python people see what may be the problem? I am not a trained python operator , just using it for my work.

genome sequence sequencing python • 881 views
ADD COMMENT
0
Entering edit mode

I have pulled the python code pasted below from somewhere, which is a few years old. It is supposed to do what I want, but it does not.

Since this question is not about python code you wrote consider faSomeRecords utility from Jim Kent (LINK). After downloading the file add execute permissions (chmod u+x faSomeRecords). Use as follows

usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.
ADD REPLY
2
Entering edit mode
3.1 years ago
Joe 21k

On closer inspection there is a lot wrong with the code in your opening post, so its easier just to start from scratch:

from Bio import SeqIO
import sys


with open(sys.argv[1], 'r') as rfh:
    to_remove = [line.strip() for line in rfh]

to_keep = []
for record in SeqIO.parse(sys.argv[2], "fasta"):
    if record.id not in to_remove:
        to_keep.append(record)

SeqIO.write(to_keep, sys.argv[3], "fasta")

Usage: python scriptname.py list_to_remove.txt input.fasta output.fasta

Not the worlds most efficient script, but works well enough.

If your IDs aren't simple strings that correspond directly, (e.g. they contain the > character etc) you'll need to do a bit more work to generate the to_remove list.

ADD COMMENT
0
Entering edit mode
3.1 years ago
JC 13k

The problem is that you are not checking if the sequence is in the black-list:

for seq in fasta_sequences:
       nam = str()
       nam = nam.stripseq.id()
       if nam in remove:
           continue #sequence is black-listed
       nuc = str(seq.seq)
       SeqIO.write([seq], f, "fasta")
ADD COMMENT
0
Entering edit mode

Hmmmm, thank you, But your amendment returns this:

Traceback (most recent call last): File "biostarans.py", line 19, in <module> nam = nam.stripseq.id() AttributeError: 'str' object has no attribute 'stripseq' ??

ADD REPLY
0
Entering edit mode

This is probably just due to a known rendering issue with Biostars. There should be parenthesis between nam.strip and seq.id so: nam.strip () seq.id()

ADD REPLY
0
Entering edit mode

that is your code, I just added the check if nam in remove

ADD REPLY
0
Entering edit mode

Yeah seems weird, the code is not coming out right when I submit in message above, hence your amendment didn't work. Sorry. There are brackets missing, which will not reproduce even I post them here now. But still, when I fix it in my terminal with your addition it still just straight up copying and pasting without editing out the remove list. Damn frustrating!

ADD REPLY

Login before adding your answer.

Traffic: 1708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6