Question: Bad Genbank format file from Vector NTI - Convert to FASTA
0
gravatar for st.ph.n
4.5 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

I'm have several .gb files from Vector NTI that I need for convert to FASTA format. I figured it would be easy, using Biopython. However, as we all know, there's always something.

Here's the first few lines of sequence from the .gb file:

ORIGIN
GTTGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG

And here's the first few lines of sequenced from a Googled example:

ORIGIN
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
       61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
      121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
      181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
      241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa

Here's the python code to convert:

#!/usr/bin/env python

from Bio import SeqIO
import sys

inp = sys.argv[1]
out = inp + ".fasta"

input_handle = open(inp, "rU")
output_handle = open(out, "w")

sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "fasta")

output_handle.close()
input_handle.close()

And here's the error:

Traceback (most recent call last):
  File "abi2fastq.py", line 3, in <module>
    from Bio import SeqIO
  File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.py", line 362, in <module>
    from . import InsdcIO  # EMBL and GenBank
  File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/InsdcIO.py", line 37, in <module>
    from Bio.GenBank.Scanner import GenBankScanner, EmblScanner, _ImgtScanner
  File "/usr/lib64/python2.6/site-packages/Bio/GenBank/__init__.py", line 52, in <module>
    from .Scanner import GenBankScanner
  File "/usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.py", line 38
    different in layout to those produced by GenBank/DDBJ."""
                                                            ^
IndentationError: expected an indented block

Does anyone know where the indentation error is here? My .gb file doesn't have lengths in the beginning of the sequence, and also isn't spaced in the body of the sequence. Could both be the problem? I've found other problems to be with Scanner.py source code from Biopython, but I updated to the newest release, and am now getting this error. I could just copy and paste the sequence into a new file with a header, but I have sever files in several directories to perform the conversion on, so this first one is just a test. Btw, all of the Vector NTI .gb files look the same.

All help is appreciated.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by st.ph.n2.4k

There is clearly a formatting problem. Did you use Vector NTI on Windows platform to produce your .gb files ? In this case you have just to use the command dos2unix ( dos2unix winFile.gb unixFile.gb ) to format properly the file.

 

 

ADD REPLYlink written 4.5 years ago by Juke-341.8k

I did not produce the file, however it most likely was produced on Windows. I tried dos2unix, and still got the same error as above. I've found forums from as far back as 2005 discussing this issue, but can't find a fix. I've come across some snippets of code to edit the Scanner.py source code, but it seems, those edits have been updated in new releases. So, I'm not quite sure the problem here.

ADD REPLYlink written 4.5 years ago by st.ph.n2.4k
1

The formats should be followed. It is not surprising that a method crashes because there is something missing in the file format. So, it seems that stuff like lengths in the beginning of the sequence must be present in genbank format.

If a correctly formatted genbank file work with your python method, it means that the problem comes from files created by Vector NTI. I think the only way is to perform a script that format your file correctly.

Nevertheless ... it seems there is a problem in your code. It should be:

sequences = SeqIO.parse(input_handle, "genbank")   instead of sequences = SeqIO.parse(input_handle, "fasta")

ADD REPLYlink written 4.5 years ago by Juke-341.8k

that was a copy/paste error (i changed it for something else, and forgot that portion). in my script it reads genbank. i'll see if i can write something to format it properly.

ADD REPLYlink written 4.5 years ago by st.ph.n2.4k

i tested it on this file here (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html), and still got the same error. I'm not so sure it's a file issue, or a source code issue.

ADD REPLYlink written 4.5 years ago by st.ph.n2.4k

Hi, I tested the file  (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with your code (above) and it work perfectly well for me .... I used python 2.7 and biopython 1.64. Which version of biopython do you use ?

ADD REPLYlink written 4.5 years ago by Juke-341.8k

Initially I was using a machine with python 2.6 and biopython 1.64. It did not work with the test file. When I used another machine with python 2.7.6 and biopython 1.64, it worked on the test file. However, neither worked with the Vector NTI file. The solution I came up with is below, since all I was interested in was creating a FASTA format file with the sequence, and I needed to do it for a lot of directories, containing several .gb files each.
 

ADD REPLYlink written 4.5 years ago by st.ph.n2.4k

As shown, that is in no way a GenBank formatted file, and it is not surprising that asking Biopython to parse is as a GenBank file fails. It doesn't even have a LOCUS line at the start, and as you noted, the sequence is not laid out in GenBank style either. It looks more like a FASTA file missing the special ">" character.

ADD REPLYlink written 4.5 years ago by Peter5.7k

There is a 'locus' and 'features' section which I did not show. There also seemed to be spacing errors in those sections as well. The file is named *.gb. It is not supposed to be FASTA format.

ADD REPLYlink written 4.5 years ago by st.ph.n2.4k

Next time at least show the first line and a ... indicating where you removed large chunks. Can you share the entire file, e.g. via gist.github.com or similar? We have in the past tweaked Biopython's GenBank parser to accept some broken GenBank files (with warnings), but further changes would depend on just how broken your Vector NTI output is.

ADD REPLYlink written 4.5 years ago by Peter5.7k
2
gravatar for st.ph.n
4.5 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

As Juke-34 said, it's definitely a formatting issue. I tried to reformat the genbank file, which worked with some python code, but there still some spacing issues that Biopython didn't like. Also, my 'Locus' header was too long. So, I wrote this bit of code to simply take the sequence information out, and write it to another file. This is much simpler than what would be in a 'normal' genbank file, where there's sequence length numbers at the beginning of the line, and the bases are split every 10 bp, and only 60 per line. I was able to write that code to produce the genbank file, but again, Biopython was cranky, if anyone wants me to post that code I can. For now, here's the quick solution I came up with to simply pull the sequence and put in a suitable identifier as the header.

#!/usr/bin/env python

import sys

inp = sys.argv[1]  # python gbtofa.py <input_file>
outhandle = inp.split('.')[0] #Remove file extension, keeping prefix to rename .gb file to      .fasta on next line
out = open(outhandle + ".fasta", "w")

seqs = []
alllines = []

with open(inp, "r") as f: #open input
        copy = False
        for line in f:
                alllines.append(line.strip())
                if line.strip() == "ORIGIN": # Start new list after ORIGIN
                        copy = True
                elif line.strip() == "//": #End on line above '//'
                        copy = False
                elif copy:
                        seqs.append(line.strip())

locus = alllines[0].split() # Use portion of LOCUS line as header for .fasta
id = locus[1].split('_')[1]
print >> out, '>' + str(id) # print header to new file

for s in seqs:
        print >> out, s  # print sequences to new file

 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by st.ph.n2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2035 users visited in the last hour