Question: Biopython Not Indexing (Or Parsing) Full .Sff File, Why?
0
gravatar for Matt
5.6 years ago by
Matt30
United States
Matt30 wrote:

Please examine the following code (I've changed some things for privacy).

>>> from Bio import SeqIO
>>> reads = SeqIO.index("/somefile.sff", "sff")
>>> print(len(reads))
81234

Shows that I have 81,234 records indexed.

However, the sff file is split up into two sections in the run statistics form the lab. The first section, region 1, has 81,234 reads. The second section, regions 7-9, have 49,876 reads.

When I try to read the file dictionary I get this:

>>> reads = SeqIO.to_dict(SeqIO.parse("somefile.sff", "sff"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 672, in to_dict
    for record in sequences:
  File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 541, in parse
    for r in i:
  File ".../python2.7/site-packages/Bio/SeqIO/SffIO.py", line 882, in SffIterator
    raise ValueError("Additional data at end of SFF file")
ValueError: Additional data at end of SFF file

The only thing I can think of is perhaps biopython is expecting there to be regions 2-6, and since they don't appear to exist in this file it just blows up. I do have the .fna and .qual files to work with too, if need be. But I would really like to just use the .sff file.

python 454 biopython • 2.2k views
ADD COMMENTlink modified 5.5 years ago by Peter5.8k • written 5.6 years ago by Matt30

Sounds strange, as through maybe two SFF files have been blindly concatenation together (Biopython would read the first file and then complain about the unexpected second bit). SFF files should be merged with the Roche tools not simply concatenated.

Can you share the SFF file (privately)?

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Peter5.8k

Hi, Peter. Thanks for the input. The data isn't mine, so I'd have to ask the PI I'm working with about that, but I would think sharing probably is unlikely. When I lookup the header information I get: (header_length 1640, index_offset 437421456, index_length 1690220, number_of_reads 81234, number_of_flows_per_read 1600). If there are multiple regions, will the header only show the number of records for just the first, or should it show for all?

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Matt30

There shouldn't be 'multiple regions', there should be one and only one index block. From the description, I think your SFF file is invalid and formed by the concatenation of multiple files. If you could send me the file privately I would be able to verify that, and split the file into two self contained SFF files which could be read individually.

ADD REPLYlink written 5.6 years ago by Peter5.8k

Peter, thanks for offering to help us out. I'll send you a private message.

ADD REPLYlink written 5.6 years ago by Matt30
1
gravatar for Peter
5.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Thanks Matt for sharing a sample file with me. This confirmed as I had guessed that it was actually several SFF format files concatenated together - which is not allowed under the original SFF definition, but perhaps Roche are extending it?

$ strings example.sff | grep -c "\.sff"
4

The next Biopython release 1.63 will give clearer error messages (pending any further information about why these files exist)

For example,

>>> from Bio import SeqIO
>>> d = SeqIO.index("example.sff", "sff")
Traceback (most recent call last):
...
ValueError: Your SFF file is invalid, post index 4 byte null padding region ended '.sff' which could be the start of a concatenated SFF file? See offset 439111676

And,

>>> from Bio import SeqIO
>>> count = 0
>>> for r in SeqIO.parse("example.sff", "sff"): count += 1
Traceback (most recent call last):
...
ValueError: Your SFF file is invalid, post index 4 byte null padding region ended '.sff' which could be the start of a concatenated SFF file? See offset 439111676
>>> count
84475

If there is any clear information about this from Roche and it is a deliberate extension to the file format, then I'd hope to extend the Biopython SFF support to handle this. In the short term, you must divide the file into traditional separate individual SFF files to parse them (by looking for the marker string ".sff").

ADD COMMENTlink modified 8 weeks ago by RamRS20k • written 5.5 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1964 users visited in the last hour