Question: Changing the record id in a FASTA file using BioPython
5
gravatar for Jeroen Van Goey
3.9 years ago by
Jeroen Van Goey2.2k
Ghent, Belgium
Jeroen Van Goey2.2k wrote:

I have the following FASTA file, `original.fasta`:

    >foo
    GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

I need to change the record id from `foo` to `bar`, so I wrote the following code:

    from Bio import SeqIO

    original_file = r"path\to\original.fasta"
    corrected_file = r"path\to\corrected.fasta"

    with open(original_file) as original, open(corrected_file, 'w') as corrected:
        records = SeqIO.parse(original_file, 'fasta')
        for record in records:
            print record.id             # prints 'foo'
            if record.id == 'foo':
                record.id = 'bar'
            print record.id             # prints 'bar' as expected
            SeqIO.write(record, corrected, 'fasta')

We printed the record id before and after the change, and get the expected result. We can even doublecheck by reading in the corrected file again with BioPython and printing out the record id:

    with open(corrected_file) as corrected:
        for record in SeqIO.parse(corrected, 'fasta'):
            print record.id                  # prints 'bar', as expected
      

However, if we open the corrected file in a text editor, we see that the record id is not `bar` but  `bar foo`:

    >bar foo
    GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

We can confirm that this is what is written to the file if we read the file using plain Python:

    with open(corrected_file) as corrected:
        print corrected.readlines()[0][1:] # prints 'bar foo'

Is this a bug in BioPython? And if not, what did I do wrong and how do I change the record id in a FASTA file using BioPython?

biopython • 5.5k views
ADD COMMENTlink modified 2.2 years ago by lakshmi.bioinformatics20 • written 3.9 years ago by Jeroen Van Goey2.2k
5
gravatar for samuelmiver
3.9 years ago by
samuelmiver420
Centre for Genomic Regulation (Barcelona, Spain)
samuelmiver420 wrote:

You have to change the record.description to achieve your goal:

from Bio import SeqIO

original_file = "./original.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id            
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar' # <- Add this line
        print record.id 
        SeqIO.write(record, corrected, 'fasta')

 

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by samuelmiver420

what happens if the original sequences has descriptions?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by jrbustosm0

The description (in a FASTA file) is the line distinguished from the sequence data by a greater-than (">") symbol in the first column. In the problem case, 'foo' is the description and that is the word we want to change so you need to use the record.description change to convert it to 'bar'.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by samuelmiver420
0
gravatar for lakshmi.bioinformatics
2.2 years ago by
India
lakshmi.bioinformatics20 wrote:

I want to change the names of 40k fasta files using a dictionary record containing the names to be changed how to do in bippython

ADD COMMENTlink written 2.2 years ago by lakshmi.bioinformatics20

See the suggestions of shenwei356, and if that doesn't help it's probably more appropriate to open a separate thread. But please be more informative, your one-line question doesn't tell us all we need to help.

ADD REPLYlink written 2.2 years ago by WouterDeCoster40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1577 users visited in the last hour