Question

Is there a standard yaml file to describe Illumina runs: sample_name, barcode, lane, flowcell?

2

Entering edit mode

7.9 years ago

14134125465346445 ★ 3.6k

Is there a format to describe sample names and their associated flowcell(s), lane(s) and barcode(s) from Illumina sequencing experiments?

The Illumina documentation describes the following notation for multiplex and non-multiplexed runs:

Naming
Illumina FASTQ files use the following naming scheme:
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz
For example, the following is a valid FASTQ file name:
NA10831_ATCACG_L002_R1_001.fastq.gz
In the case of non-multiplexed runs, <sample name> will be replaced with the lane numbers (lane1, lane2, ..., lane8) and <barcode sequence> will be replaced with "NoIndex".

And I have seen bcbio has some code and example yaml files to describe some of this, and it seems scilifelab has adopted it:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html https://github.com/SciLifeLab/scilifelab/blob/e5f4be45e2e9ff6c0756be46ad34dfb7d20a4b4a/scilifelab/bcbio/flowcell.py

What I am looking for is a standard or something close to a standard that people have adopted for this.

Does anything like this exist? Is Common Workflow Language CWL dealing with this? Galaxy? Genologics?

illumina flowcell run lane barcode • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Biostar 20 • written 7.9 years ago by 14134125465346445 ★ 3.6k

score 0 · Answer 1 · 2016-06-03

Illumina CASAVA generates some XML files after demultiplexing:

<DemultiplexConfig>
  <Software Version="CASAVA-1.8.2" CmdAndArgs="...">
  <FlowcellInfo ID="C3FGGACXX" Operator="x" Recipe="" Desc="">
    <Lane Number="1">
      <Sample ProjectId="P1" Control="N" Index="CTTGTA" SampleId="S1"  />
      <Sample ProjectId="P2" Control="N" Index="GCCAAT" SampleId="S1" />
(...)