Question: Parser for Illumina Novaseq metadata
gravatar for BCArg
23 months ago by
BCArg60 wrote:

I am trying to find a parser for the metadata generated by Illumina Novaseq (the .bin files generated under the interop directory). Eventually, I am trying to parse parameters such as the percentage of clusters passing filter, the percentage of reads with a Q score >=30, error rate and etc.

I have found an R package called savR, but I am afraid it does not handle the output from Novaseq files correctly.

Other than that I have found an InterOP python library. They claim it supports Novaseq output data as well, but I haven't tried yet, have only tried with their own Miseq example files.

This library appears to do the job, but I cannot find a proper documentation for it. They do show a few examples of its functionalities on their github page, yet I cannot find a full documentation with all the library functionalities.

Other than these two libraries/packages, would anyone recommend any other parser for Illumina Novaseq output data?

next-gen genome • 1.2k views
ADD COMMENTlink modified 5 weeks ago by brycefoster0 • written 23 months ago by BCArg60
gravatar for Paul
23 months ago by
European Union
Paul1.4k wrote:

Hi, look at the Illumina InterOP Here. You can use anaconda to install here.

Syntax for data extraction is very easy and then you can plot data by gnuplot or R.

example to get table (csv):

interop_summary runfolder > /path/to/my/output

example fot plotting:

interop_plot_qscore_heatmap runfolder | gnuplot

Note: Just tested on our NovaSeq runs (WGS) and works perfectly fine.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Paul1.4k

In fact those commands work, though they are shell commands. Or do they have anything to do with the python module that I installed (interop)? Also, the synthax for calling interop_summary, for instance, is:

interop_summary run_folder > path/to/my/output (source interop_summary -help)

ADD REPLYlink modified 23 months ago • written 23 months ago by BCArg60

Just fixed my typo in example - thanks. Yes those are shell command as alternative to parsing SAV data.

ADD REPLYlink written 23 months ago by Paul1.4k

great, thanks very much, that does exactly what I wanted.

ADD REPLYlink written 23 months ago by BCArg60

Glad to help you (I was working on the same task last week :-))!!!

ADD REPLYlink written 23 months ago by Paul1.4k

At work we have another server in which anaconda is not installed, so that we installed the interop library with pip install interop. Although we managed to import the interop module within python, the same shell commands that you showed (and did the job for me) could not be called from the shell. Any idea of how/ if I can get the shell commands (e.g. interop_summary) without conda install -c bioconda illumina-interop, or only with pip install ? Thanks again

ADD REPLYlink written 21 months ago by BCArg60
gravatar for genomax
23 months ago by
United States
genomax89k wrote:

You can use sequence analysis viewer from Illumina (Note: Windows only), if you have access to InterOp folder and .xml files from the original NovaSeq data folder. This is a view-only option.

If you are looking for programmatic means to parse this information then Illumina has a set of c++ libraries on their GitHub site. Note: Illumina does not provide technical support for their open source software.

ADD COMMENTlink modified 23 months ago • written 23 months ago by genomax89k

I am yes, trying to parse the Novaseq metadata programatically in Python, so the Illumina Sequence Analysis Viewer is not an alternative really

ADD REPLYlink modified 23 months ago • written 23 months ago by BCArg60

Library I linked above is c++. It specifically notes that it supports NovaSeq and all other Illumina sequencers (except oldest GA).

You could also parse summary files that can be found in a processed NovaSeq flowcell in FCID/Unaligned/Stats if you are looking to populate this information in a user/LIMS-like application.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax89k
gravatar for brycefoster
5 weeks ago by
brycefoster0 wrote:

In python you read the files as binary files ....

for the tile metrics:

    bit_len = os.path.getsize(tile_file) * 8

    fh = open(tile_file, 'rb')
    file_ver = int.from_bytes(, ENDIAN)  # version number == "2"
    recordlen = int.from_bytes(, ENDIAN)  # length of each record == 10 (for TileMetrics)

    if file_ver == 2:
        fh.pos = 16  # skip the above bytes which are invariant for this
    elif file_ver == 3:
        fh.pos = 48
        print("- error: unhandled file version: %s" % file_ver)

    lane_density = 0 # 100
    lane_density_pf = 0 # 101
    cluster_cnt = 0 # 102
    cluster_pf_cnt = 0 # 103
    align_cnt = 0 # 300+
    align = 0 # 300+

    #read records bytewise per specs in technote_rta_theory_operations.pdf from ILMN
    r = int((bit_len - 16) / (recordlen * 8))

    for i in range(0, r):

        if file_ver == 2:
            #2 bytes: lane number (uint16)
            #2 bytes: tile number (uint16)
            #2 bytes: code (uint16)
            #4 bytes: value (float32)

            lane = int.from_bytes(, ENDIAN)
            tile = int.from_bytes(, ENDIAN)
            metric = int.from_bytes(, ENDIAN)
            val = struct.unpack("f",[0] # 4

            # 100 = cluster denisty
            # 101 = cluster denisty passing filters
            # 102 = number of clusters
            # 103 = clusters passing filters
            #if metric == 102 or metric == 103:
            #if metric == 1001:
            #if metric >= 300 and metric < 399:

            if target_lane == 0 or lane == target_lane:

                #if metric == 1003:
                #    pass
                #    #print "Lane: %s, tile: %s, metric: %s, val: %s" % (lane, tile, metric, val)

                if metric == 100:
                    lane_density += val
                    tile_cnt += 1

                elif metric == 101:
                    lane_density_pf += val

                elif metric == 102:
                    cluster_cnt += val

                elif metric == 103:
                    cluster_pf_cnt += val

                elif metric >= 300 and metric < 399:
                    align_cnt += 1
                    align += val

        elif file_ver == 3:

            #2 bytes: lane number (uint16)
            #4 bytes: tile number (uint32)
            #1 byte: code (char)
            #if code == 't' 74 in hex, 116 ascii
            #    4 bytes: cluster count (float32)
            #    4 bytes: pf cluster count (float32)
            #if code == 'r' 72 in hex, 114 ascii
            #    4 bytes: read number (uint32)
            #    4 bytes: percent aligned (float32)

            #0400 (lane/16) 8e060000 (tile/32) 72 (code/8)
            lane = int.from_bytes(, ENDIAN)
            tile = int.from_bytes(, ENDIAN) # 1678 = 068e
            code = int.from_bytes(, ENDIAN)

            if code == 116: # t, hex = 74
                if report_lane:
                    cluster_cnt += struct.unpack("f",[0] # 4
                    cluster_pf_cnt += struct.unpack("f",[0] # 4
                    tile_cnt += 1
                    _ = struct.unpack("f",[0] # 4
                    _ = struct.unpack("f",[0] # 4

            elif code == 114: # r, hex = 72
                # 04000000 0000c07f
                # read 1673, lane 4 = 0x0689 - 8906 0000
                #0000 0029 58b5 3e = 0.354188233614

                read_num = int.from_bytes(, ENDIAN) # 1 or 4? - not used

                a = struct.unpack("f",[0] # 4 # nan for most values
                #print "- a = %s" % a
                if a > 0:
                    align_cnt += 1
                    align += a

ADD COMMENTlink written 5 weeks ago by brycefoster0

@brycefoster very helpful, thanks How to get total yield that's in Bustard?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genya3530
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1054 users visited in the last hour