Question

How to automate bcl2fastq such that it is launched every time the run finishes off on Illumina sequencer?

6

Entering edit mode

4.6 years ago

lakhujanivijay 5.8k

Hi All,

I am wondering what is the most widely used method/program/tool to automate the launch of bcl2fastq program as soon as the run finishes off on the Illumina machine. I am about to start writing my own custom shell script for the same; but before I do that I want to know whether ready-made solutions are already available.

The run folder nomenclature is as follows:

My idea would be like this -

Keep looking for folder containing the "Instrument Serial No." in its name:
        for every 15 minutes :
            Look inside the run log file for something which says - sequencing complete
            if YES:
                   Launch bcl2fastq
            else PASS:

If someone has better idea which could be implemented, please do let me know.

illumina bcl2fastq automation • 4.6k views

ADD COMMENT • link updated 4.5 years ago by Charles Warden 8.2k • written 4.6 years ago by lakhujanivijay 5.8k

2

Entering edit mode

Look for one of these files the FC folder. RTAComplete.txt (HIseq/MiSeq), SequenceComplete.txt, RTAComplete.txt/CopyComplete.txt (NovaSeq). Which signifies completion of the sequencing. That would be your signal to start processing/copying.

With MiSeq, SampleSheet.csv would have been provided at run start so should be available there. With other sequencers you will need to inject right SampleSheet.csv into the folder or source it from your LIMS/other location to start the analysis.

Note: Some sequencers continue to write data to other directories (HiSeq 4000, possibly NovaSeq) even when these files are seen. So to be safe add another hour before you start copying the data out/analyzing it.

ADD REPLY • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

CopyComplete.txt and SequenceComplete.txt files are generated on the external location that we have provided but not on the local hard disk installed in the sequencer. Is that expected? Both locations have the RTAComplete.txt files though.

ADD REPLY • link 4.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

If you are going to work from the external storage location then yes. We also do something similar.

ADD REPLY • link 4.6 years ago by GenoMax 141k

1

Entering edit mode

I am in the same situation, and was thinking about a similar solution. So curious about the replies that you get on this post.

ADD REPLY • link 4.6 years ago by gb ★ 2.2k

0

Entering edit mode

It is good to know how you could automate automatic running of bcl2fastq, but I have encountered a few reasons why you may still need to run some base calling via command line:

1) If you have multiple library types (such as single-barcode, dual-barcode, 10X samples, custom UMI libraries, etc.). This means you may have to run bcl2fastq more than once, and/or run cellranger mkfastq instead of bcl2fastq.

2) Your original barcode information is not correct. With mixed library types, I think there is a decent chance that I have had to change a barcode after an initial step of base calling with bcl2fastq (before a second step to return user results). However, maybe this can vary between individuals. This tends to happen less often if you don't mix barcode types (such as a rapid run), but I think it can also happens more often if you have >50-100 samples of a given type in a run.

For example, I have withdrawn at least 1 record from the SRA because it was actually a mix of samples from different labs (one sample has the wrong barcode, and was mixed in the sample that had the right barcode).

3) You might realize that you need to either change some base calling parameters, or prefer to use non-default parameters (such as not allowing any barcode mismatches)

It is not directly relevant to the automation question, but I have a discussion about possible QC flags (while the solution / explanation can vary, this might indicate a need to slow down and processes fewer samples more carefully, but I am mostly putting some ideas out there to discuss):

Calling Single-Barcode Samples from Mixed Runs as Dual-Barcode Samples | Possible Illumina Run QC Flags?

This is where I mention that I use a non-default setting of allowing 0 barcode mismatches. However, to be clear, I am not advocating allowing more mismatches (or changing parameters to artificially increase the number of reads provided for a sample) - in contrast, I am trying to better understand when runs and/or lanes need to be thrown out due to quality concerns.

ADD REPLY • link 4.5 years ago by Charles Warden 8.2k

1

Entering edit mode

@Charles: This is not an answer to the original question. You should consider moving this to a comment on the original post.

Edit: Adding some more thoughts.

While you bring up valid exceptions, it would be reasonable to expect that someone trying to automate bcl2fastq runs will have back-end infrastructure (e.g. a LIMS) that is used to track samples and orchestrate the management and analysis.

We do run a similar system and yes there are errors at times but they generally can be dealt with after the automated analysis runs. Exceptions like cellranger demux runs could also be programmed, if you do enough of them to warrant the additional work that would be needed to account for them.

ADD REPLY • link 4.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the suggestion - I have accordingly converted the answer to a comment.

ADD REPLY • link 4.5 years ago by Charles Warden 8.2k

3

Entering edit mode

4.6 years ago

drkennetz ▴ 560

I wrote a more generalized shell script back in the day to do the same. It actually just runs stat on the directory which displays the status of a file or filesystem. I did stat rather than looking for a file because locally we transfer our data from the sequencer to an hpc filesystem and sometimes files like RTAComplete.txt etc, would show up before all the files were done being copied over. The script was generalized as follows:

RUN_PATH=$1; shift
RUN_PATH=`echo $RUN_PATH | sed 's/\/$//'`
cd $RUN_PATH

OLD_STAT="initial"

while true; do
    NEW_STAT=`stat -t $RUN_PATH`
    if [ "$OLD_STAT" != "$NEW_STAT" ]; then
        echo 'Directory is still updating'
        sleep 2h
        OLD_STAT=$NEW_STAT
        echo 'Checking again.'
    elif [ "$OLD_STAT" = "$NEW_STAT" ]; then
        echo 'Directory is done updating. Move on.'
        break
    fi
done
echo 'The Loop Has Been Left.'

bcl2fastq -R $RUN_PATH -r 6 -w 6 -p 8 "$@"

It runs a status check on the run directory and stores the results, then checks every 2 hours. If the stat does not change, that means the directory is done updating and you can kick off bcl2fastq. Otherwise, you stay in the while loop.

To kick it off you just do ./dir_checker.sh /path/to/run/ and you may want to pass it to the background because it will occupy a terminal until it is done.

ADD COMMENT • link 4.6 years ago by drkennetz ▴ 560

3

Entering edit mode

You might also consider rolling a solution using inotify or similar. There are lots of bindings for it for CLI use now, e.g: https://github.com/dsoprea/PyInotify

ADD REPLY • link 4.6 years ago by Joe 21k

0

Entering edit mode

Thanks drkennetz

I will try that out

ADD REPLY • link 4.6 years ago by lakhujanivijay 5.8k

score 3 · Accepted Answer · 2019-11-06

Hi

Coming back to my own question with an answer. After a lot of careful considerations and talking to the Illumina people, it turned out that generation of CopyComplete.txt (NovaSeq) is a good trigger to start bcl2fastq. Here is a little bit more information on that:

SequenceComplete.txt indicates that the sequencing run has finished.
CopyComplete.txt is created by the Universal Copy Service (UCS) when all files have been copied to their destinations and run completion signal has been triggered.
The RTAComplete.txt file indicates that the images that are generated by the system have been converted to Basecalls. The Basecalls are stored as .bcl files. These bcl files can then be used as input for BCL2Fastq to produce the fastq files.