Question

Biopython: renaming SeqRecords using dictionary values

0

Entering edit mode

8.3 years ago

cdarwin • 0

(Disclaimer: self-teaching; knowledge is minimal)

Hello all.

I am trying to change the record IDs for a batch of sequences using some metadata. I have two files: the metadata in a text file (tab-delimited) and it is a simple format (file1):

proteinID    organismID
string1      string2

File 2: a fasta file with the proteinID as the leading string after the >

>proteinID...

What I want to do is rename the sequences in File 2 using the correspondence from File 1.

>organismID...

So far, I have created a dictionary from file 1 using the protein IDs as the key

id_match_dict = {}
with open('file1.txt') as id_match:
    for line in id_match:
        (key,val) = line.strip("\n").split("\t")
        id_match_dict[str(key)] = val

This has worked well so far. Now I am trying to use this dictionary to modify the id of the SeqRecord objects using BioPython (record.id). My attempts at this have been really bad and don't even want to post what I have written. Suffice it to say, I am at a loss at this point. Could anyone help me on this? (or even point me in the right direction- I have no clue how to approach this problem)

[Please let me know if I need to provide more information, I am trying to keep this brief]

Thank you in advance!

python biopython fasta • 3.3k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by cdarwin • 0

0

Entering edit mode

from Bio import SeqIO

import deepcopy

handle = open('file1', "r")
handle2 =open('file2',"rU')

For each_line in handle:
    ***storing in dictionary goes here***
      Suppose dictionary is Id_mapper
for record in SeqIO.parse(handle, "clustal"):
    Record_mod = deepcopy(record)
    Record_mod.id = id_mapper[record.id]
    SeqIO.write(record_mod,handle2,"fasta")

Try this code...

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by nvijay.1991 • 0

0

Entering edit mode

I have a Galaxy tool which would do this nicely for you, https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_rename - written in Python but currently it uses the Galaxy FASTA parser rather than the Biopython one.

ADD REPLY • link 8.3 years ago by Peter 6.0k

1

Entering edit mode

8.3 years ago

Matt Shirley 10k

I don't have a Biopython answer, although it should be pretty straightforward. I would suggest using pyfaidx for this:

ADD COMMENT • link 8.3 years ago by Matt Shirley 10k

0

Entering edit mode

Excellent! Thank you so much for your response. I have never heard of the pyfaidx module and am glad to discover it.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by cdarwin • 0

Ram · Accepted Answer · 2015-12-16

Something like this, based on your start:

# Load name mapping as a dictionary
id_match_dict = {}
with open('file1.txt') as id_match:
    for line in id_match:
        if line.strip():
            old, new = line.strip("\n").split("\t")
            id_match_dict[old] = new

There are many ways to do the next bit, this uses plain strings and outputs the FASTA file with no line wrapping:

from Bio.Seq.FastaIO import SimpleFastaParser

in_filename = "old_names.fas"
out_filename = "new_names.fas"

with open(in_filename) as in_handle:
    with open(out_filename, "w") as out_handle:
        for title, seq in SimpleFastaParser(in_handle):
            name, descr = title.split(None, 1)
            name = id_match_dict[name]
            out_handle.write(">%s %s\n%s\n" % (name, descr, seq))

NOTE: This will give a KeyError if a name not in your table is found. What would you want to happen? Leave the old name as is?