Question

Changing the record id in a FASTA file using BioPython

7

Entering edit mode

8.6 years ago

Jeroen Van Goey 2.3k

I have the following FASTA file, original.fasta:

>foo
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

I need to change the record id from foo to bar, so I wrote the following code:

from Bio import SeqIO

original_file = r"path\to\original.fasta"
corrected_file = r"path\to\corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id             # prints 'foo'
        if record.id == 'foo':
            record.id = 'bar'
        print record.id             # prints 'bar' as expected
        SeqIO.write(record, corrected, 'fasta')

We printed the record id before and after the change, and get the expected result. We can even doublecheck by reading in the corrected file again with BioPython and printing out the record id:

with open(corrected_file) as corrected:
    for record in SeqIO.parse(corrected, 'fasta'):
        print record.id                  # prints 'bar', as expected

However, if we open the corrected file in a text editor, we see that the record id is not bar but bar foo:

>bar foo
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA</pre>

We can confirm that this is what is written to the file if we read the file using plain Python:

with open(corrected_file) as corrected:
    print corrected.readlines()[0][1:] # prints 'bar foo'

Is this a bug in BioPython? And if not, what did I do wrong and how do I change the record id in a FASTA file using BioPython?

BioPython • 17k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.6 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

6.9 years ago

lakshmi.bioinformatics ▴ 30

I want to change the names of 40k fasta files using a dictionary record containing the names to be changed how to do in bippython

ADD COMMENT • link 6.9 years ago by lakshmi.bioinformatics ▴ 30

0

Entering edit mode

A: Replace names in FASTA file with a known character string from a text file

A: Renaming fasta headers according to a matching name list

ADD REPLY • link 6.9 years ago by shenwei356 8.4k

0

Entering edit mode

See the suggestions of shenwei356, and if that doesn't help it's probably more appropriate to open a separate thread. But please be more informative, your one-line question doesn't tell us all we need to help.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

Ram · Accepted Answer · 2015-09-01

7

Entering edit mode

8.6 years ago

samuelmiver ▴ 440

You have to change the record.description to achieve your goal:

from Bio import SeqIO

original_file = "./original.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id            
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar' # <- Add this line
        print record.id 
        SeqIO.write(record, corrected, 'fasta')

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.6 years ago by samuelmiver ▴ 440

0

Entering edit mode

what happens if the original sequences has descriptions?

ADD REPLY • link 8.6 years ago by jrbustosm • 0

0

Entering edit mode

The description (in a FASTA file) is the line distinguished from the sequence data by a greater-than (">") symbol in the first column. In the problem case, 'foo' is the description and that is the word we want to change so you need to use the record.description change to convert it to 'bar'.

ADD REPLY • link 8.6 years ago by samuelmiver ▴ 440

0

Entering edit mode

I am having a similar problem - only when i modify both ID and description in the same way (as suggested in the answer) do i get a name change as required. So the code below works:

for record in records:
    record.id = "NEW_TEXT" + record.id
    record.description = "NEW_TEXT" + record.description

My ID is the accession number from ENA database and the description is initially identical other than some text description of the sequence origin. However, if i do:

for record in records:
    record.id = "NEW_TEXT" + record.id
    record.description = "NEW_TEXT" + record.description.replace(" ", "_")

My output file duplicates the accession (presumably its using both the ID and description as the new FASTA heading. Adding the .replace(" ", "_") to the altered ID does not solve the problem.

Possibly needs a new question and propper description?

ADD REPLY • link updated 18 months ago by Ram 43k • written 3.9 years ago by __mark- ▴ 10

0

Entering edit mode

I think I had a similar problem where I wanted to change the fasta header due to a whitespace. i.e., "Barcode a" to "Barcode_a" and got an additional line in the header.

Overcame this by doing

for record in records:
    record.description = record.description.replace(" a", "_a")
    record.id=record.description

ADD REPLY • link updated 18 months ago by Ram 43k • written 3.8 years ago by LG • 0