BCL data for bcl2fastq 2.20
0
0
Entering edit mode
3 months ago
Eros • 0

Hello,

I am a Data Engineering Intern and I am developing an automated pipeline on Kubernetes for processing BCL files (bcl2fastq 2.20 -> fastqc -> multiqc). I have a lot of real data produced by an Illumina NovaSeq6000, but I would to find compatible small-sized BCL data for testing purposes. I am struggling to find those data. I have a BaseSpace account, but from the website, I am only able to download the Run info of the demo datasets, not the whole data including the BCL files and sample sheet.

Do you know how can I download them or where I can find what I am looking for?

BCL bcl2fastq • 477 views
1
Entering edit mode

For what it's worth, Illumina documents their file formats surprisingly openly and thoroughly, so even if there are no official example raw run directories for testing (though that'd be nice!) you could create your own with dummy data if needed. This guide gives details like the structure of the binary files (bcl, stats, etc.) along with bcl2fastq usage:

https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2-v2-20-software-guide-15051736-03.pdf

For example, last fall I had a similar goal to yours (a minimal raw run directory structure for testing my own software) and converted a handful of reads backwards into bcl for a tiny run directory (72 KB compressed):

https://github.com/ShawHahnLab/igseq/blob/dev/igseq/data/examples/runs/YYMMDD_M05588_0232_000000000-CLL8M.tgz

Alternatively you might be able to chop down a real run directory to a manageable size by just keeping, say, one lane and one tile. bcl2fastq has --tiles that could help there, and a few options for ignoring missing files too.

GenoMax is probably right that this is a niche area, so you might not find much available online, but hopefully some of these options might help.

0
Entering edit mode

Thanks for your reply! I saw your data, but it does not contain the samplesheet.csv. The problem is that I need also a sample sheet, cause from that I can deduce easily the names of the fastq files that are to be produced, prior to launching bcl2fastq. Therefore after bcl2fastq, I can properly launch, in an automatic way, the correct amount of fastqc jobs according to the number of fastq files produced. I could adjust the pipeline to work with this data, but then I would end up with two different pipelines, one for test and one for production, therefore, making the test not useful.

Regarding the --tiles option, it could work, but again I would need a way to filter out the fastq file names from the samplesheet that will not be produced in order to spawn the fastqc jobs correctly, which requires to modify the code that extract the list of names out of the samplesheet, ending up in the situation described before.

1
Entering edit mode

You can also generate test sample sheets using Illumina Experiment Manager (LINK). Note: This software is windows only.

0
Entering edit mode

Ok, I will try this one, thanks :)

1
Entering edit mode

Whoops you're right, I'd ignored the fact that part of my procedure in that tool is generating a sample sheet automatically, so I never included one in the run tarball. But like GenoMax said you can make one. If you use that example as-is, just note that it's a bit of a weird case because of our protocol (single index, nonstandard barcoding) but you could make your own run dir if needed in roughly the same way.

0
Entering edit mode

Your best bet is to ask people who you are working for to provide you with a smaller dataset from say a MiSeq. Raw data folders are rarely required outside of initial bcl2fastq processing and your chances of finding a public download are not great.