Question

What do you consider to be raw data from Illumina sequencers? Possible legacy data.

0

Entering edit mode

5.4 years ago

Tawny ▴ 180

Today I consider the raw data to be the demultiplexed fastq files in gzip format that come off the sequencer.
I believe many years ago there was a different output of data from sequencers. I have been trying to find an example of that different raw data output.

Does anyone have an example of the raw data output (pre-demultiplexing) that they can share? Or maybe describe what that output looked like?

Can you still get this type of data from Illumina sequencers today?

sequencing • 2.9k views

ADD COMMENT • link updated 5.4 years ago by Gabriel R. ★ 2.9k • written 5.4 years ago by Tawny ▴ 180

score 2 · Answer 1 · 2018-12-19

Can you still get this type of data from Illumina sequencers today?

No you can't get data in the formats being described below. Raw data today would be bcl files that you can find in the original data folder from Illumina sequencers.

Original Illumina Genome Analysis Pipeline Software took ".tif" format images produced by the Genome Analyzer (as sequencers were referred to back then) and did Image Analysis --> Base Calling --> Sequence Analysis to produce a variety of text and html output files.

There was a goat_analysis.py script (General Oligo Analysis Tool (GOAT) which called subscripts for three Pipeline modules: Firecrest, Bustard (“bustard.py”), and Generation of Recursive Analyses Linked by Dependency GERALD (“GERALD.pl”)). Firecrest was used for image analysis. It identified cluster positions and extracted intensities. Bustard was used for base calling. It dealt with spectral cross-talk and phasing. Finally GERALD did the alignments using Efficient Large-Scale Alignment of Nucleotide Databases (ELAND). It also (optionally) ran PhageAlign, an exhaustive aligner.

A typical set of raw data files looked something like this.

A few file types were generally relevant to end-users and they were as follows.

“Raw” sequence file – called s_*_eland_query.txt. Where * (a number between 1 and 8) indicated the lane (a typical Illumina 1G flowcell had 8 lanes and ran 8 libraries/samples, this was prior to multiplexing) in which the particular sample was run. Note: The data in this file was not “quality” filtered/processed and possibly contained sequence errors in the last 3-5 base pairs (typically for a 35 bp sequencing run).
“ELAND” sequence alignment results file – called s_*_eland_results.txt or s_*_eland_multi.txt. Where * (a number between 1 and 8) indicated the lane in which the sample was run. This was considered an intermediate output file since it contains unfiltered ELAND alignment output.
Illumina’s Genome Analyzer pipeline software used a program called ELAND (Efficient Large-Scale Alignment of Nucleotide Databases) to align the sequence reads to specified “reference” genome. ELAND searched a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match.
Depending on the type of Eland analysis performed, users received s_*_eland_results.txt (regular eland analysis) or s_*_eland_multi.txt files (“eland_extended” analysis, generally for alignments of sequences > 32 bp).
“Filtered” sequence file – called s_*_sequence.txt. Where * (a number between 1 and 8) indicated the lane in which the sample was run. This file contained sequences that passed the ELAND_quality filter criteria. This file was in the “fastq” format by default.
Sequence “Export” file – called s_*_export.txt. This additional file contained some extra information when compared with the “s_*_sequence.txt” file.
“Sorted” sequence file – called s_*_sorted.txt. This output file is similar to s_*_export.txt, except it contained only entries for reads which passed purity filtering and had a unique alignment in the reference. These reads were sorted by order of their alignment position, which was meant to facilitate the extraction of ranges of reads for purposes of visualization or SNP calling.
“Anomaly” sequence file – called s_*_anomaly.txt. File applicable only in case of “paired-end” runs. This file contained reads that did not align in a “paired-end” analysis.

Someone may have data from this long gone era but it may not be readily available on web.

score 2 · Answer 2 · 2018-12-19

2

Entering edit mode

5.4 years ago

Gabriel R. ★ 2.9k

There are different layers of raw data way before demultiplexing:

The files at the most basic layer are raw images, those are massive.
After image analysis, you have intensity for each cluster, those are normally in .cif format. They do not contain bases but raw intensity.
After basecalling, files are stored in bcl which contain the bases plus quality scores
Then you can have the raw fastq containing all cycles on one line.

Research groups either store #3. if you want to be more exhaustive, 2. I have never seen any group making copies of 1.

ADD COMMENT • link 5.4 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

@Gabriel R This is very helpful, I appreciate you taking the time to provide your response.

ADD REPLY • link 5.4 years ago by Tawny ▴ 180

0

Entering edit mode

a comment, @genomemax is correct with his graph, it presents some of the old file formats. Illumina has changed the way it represents its internal data a few times. Good luck!

ADD REPLY • link 5.4 years ago by Gabriel R. ★ 2.9k