Question: Biopython script to change file formats and headers
0
gravatar for skbrimer
3.2 years ago by
skbrimer530
United States
skbrimer530 wrote:

Hi group, 

I'm trying to make a script in python that will change formats from fastq to fasta, which I have: 

from Bio import SeqIO
import sys 

# grabbing the file and the name 
seq_file = sys.argv[1]
labels = seq_file.split(".")

# converting the file from fastq to fasta
SeqIO.convert(seq_file,"fastq",labels[0]+".fasta","fasta")

no problem; but now I would like to change the header of the fasta file in the same script and I'm stuck. When I add the SeqIO.parse function  like this

for seq_record in SeqIO.parse(labels[0]+".fasta","fasta"):
    seq_record.id = labels[0] # renaming the pseudogene with the lab id
    SeqIO.write(seq_record,labels[0]+".fasta","fasta")

​I get an error saying I didn't define seq_record, which I thought I did, and the script fails. I thought the way this script would work is it would convert the file, making the new fasta file (which it does when I do not have the parse function in there), then parsing that file. 

So now I'm wondering if it is in fact producing that file since its no longer the end of the script, do I need to make a temp file in order to due both actions in one script? 

****EDIT****

Well it works now, so if anyone would like to do a similar thing here was my solution 

# this script is used to convert fastq files to fasta files 
# then to rename the fasta ID with the sample ID from the lab

from Bio import SeqIO
import sys 

# grabbing the file and the name 
seq_file = sys.argv[1]
labels = seq_file.split(".")

# converting the file from fastq to fasta
SeqIO.convert(seq_file,"fastq",labels[0]+".fasta","fasta")

# taking the converted file and then changing the fasta header
for seq_record in SeqIO.parse(labels[0]+".fasta","fasta"):
    seq_record.id = labels[0] # renaming the pseudogene with the lab id
    SeqIO.write(seq_record, labels[0]+".fasta","fasta")

 

 

 

 

 

 

biopython processing • 1.8k views
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by skbrimer530
2
gravatar for skbrimer
3.2 years ago by
skbrimer530
United States
skbrimer530 wrote:

Here is the final script :D

# this script is used to convert fastq files to fasta files 
# then to rename the fasta ID with the sample ID from the lab

from Bio import SeqIO
import sys 

# grabbing the file and the name 
seq_file = sys.argv[1]
labels = seq_file.split(".")

# converting the file from fastq to fasta
SeqIO.convert(seq_file,"fastq",labels[0]+".fasta","fasta")

# taking the converted file and then changing the fasta header
handle = open(labels[0]+".fasta","rU")

for seq_record in SeqIO.parse(handle,"fasta"):
    old_header = seq_record.id
    new_header = labels[0]
    seq_record.id = new_header + "_" + old_header # renaming the pseudogene with
                                                  # the lab id and the referance 
                                                  # used
    seq_record.description = "" # this strips the old header out
    SeqIO.write(seq_record, labels[0]+".fasta","fasta")

handle.close()

ADD COMMENTlink written 3.2 years ago by skbrimer530
1
gravatar for thefirstrealace
3.2 years ago by
Germany
thefirstrealace30 wrote:

Hello,

your second code block looks a little bit weird to me, normally you need to define a file handle for the SeqIO parser like shown below:


handle = open(labels[0] + ".fasta", "rU")
for seq_record in SeqIO.parse(handle, "fasta"):
       seq_record.id = labels[0] # renaming the pseudogene with the lab id
       SeqIO.write(seq_record,labels[0]+".fasta","fasta")

handle.close()

i am not completely sure if SeqIO parser can actually work without a file handle, but maybe you try it out and see if my version above already fixes your problem.

 

### EDIT ###

I didn't see your edit, so if it works then just ignore my post :D

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by thefirstrealace30

No worries! Thank you for your help and you are correct if you are using the an older version, the latest version of Biopython, at least for the SeqIO.parse function, doesn't require the handle anymore. 

Now I'm just trying to figure out why its renaming the header and keeping the old header as well. 

Thank you again. 

ADD REPLYlink written 3.2 years ago by skbrimer530

I also add your suggestion, it makes for better file control. I'm really bad about remembering to use open and close file commands since my stuff is small. I just need to be more vigilant and the handle helps with that. 

I also found a post that said to completely remove the old header I need to edit the old description I will paste my final code below. 

Thank you again. 

ADD REPLYlink written 3.2 years ago by skbrimer530
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 793 users visited in the last hour