Audit trail for Bioinformatics software tools
2.8 years ago
I analyze several samples every day for variant analysis using align to reference method. For this purpose I use different Bioinformatics software such as Bowtie2/BWA, Samtools, and Freebayes. Is there a way in which I can know which version of software was used to process a particular sample. This should work like an audit trail, informing say Sample1 was aligned using bowtie2 vX.X.X, Sample2 was analysed using Bowtie2 vX.X.Y, and so on.

For example

bowtie2 --version command gives the output of Bowtie2 installed on the system as follows:

/usr/local/bin/bowtie2-align-s version 2.2.2

another approach,

samtools view -H sample.sorted.bam

@HD VN:1.0 SO:coordinate @SQ SN: reference LN: @PG ID:bowtie2 PN:bowtie2 VN:2.2.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 -x -I 0 -X 1000 --fr -p 16 --local --passthrough -1 /tmp/40466.inpipe1 -2 /tmp/40466.inpipe2"

Both these commands do not tell that Sample1 was processed using bowtie2, Sample2 was processed using Bowtie2 and so on.

I would like to get an audit trail, where I will know for each software which version was used to process which sample.


You could capture this information (bowtie2 --version) in your analysis master logs for projects. Unix command script can capture all interactive dialog from a terminal sessions. Standard error and standard output logs captured from the analysis should include this information and can be saved.

You could also use a workflow system like snakemake to capture/automate your interactions and log those actions.

Indeed, for audit trails in corporate and clinical settings, I produce a log for each sample that looks something like:

Beginning analysis script on Wed  6 Sep 11:56:47 UTC 2017, run by KevinBlighe with the following parameters:
    1   /home/ubuntu/Placa4.tmp/71/Files/71_S2_L001_R1_001.fastq.gz
    2   /home/ubuntu/Placa4.tmp/71/Files/71_S2_L001_R2_001.fastq.gz
    3   /home/ubuntu/reference/hg38.fasta
    4   Placa4
    5   GNT071
    6   /home/ubuntu/pipeline/BED/Versao1_Sorted.hg38.bed
    7   NULL
    8   GNT081
    9   1.333333
    10  0.666667
    11  20
    12  70
    13  illumina
    14  18
    15  50
    16  relaxed
    17  /home/ubuntu/pipeline/validation/Full/
    18  KevinBlighe
Beginning analysis step 1 (adaptor and read quality trimming) on Wed  6 Sep 11:56:47 UTC 2017
Beginning analysis step 2 (alignment) on Wed  6 Sep 11:57:20 UTC 2017
Beginning analysis step 3 (marking and removing PCR duplicates) on Wed  6 Sep 11:58:13 UTC 2017
Beginning analysis step 4 (remove low mapping quality reads) on Wed  6 Sep 11:58:28 UTC 2017
Beginning analysis step 5 (QC) on Wed  6 Sep 11:58:31 UTC 2017
Beginning analysis step 6 (downsampling / random read sampling) on Wed  6 Sep 11:58:46 UTC 2017
Beginning analysis step 7 (variant calling) on Wed  6 Sep 11:58:53 UTC 2017
Beginning analysis step 8 (annotation) on Wed  6 Sep 12:00:26 UTC 2017
Skipping analysis step 9 (PCR results and CNV analysis) - no results file provided
Beginning analysis step 10 (customising VCF for haplotype identification) on Wed  6 Sep 12:03:02 UTC 2017
Beginning post-analysis tidy-up on Wed  6 Sep 12:03:02 UTC 2017
Analysis script finished on Wed  6 Sep 12:03:02 UTC 2017

Versions of the programs that are used are stored elsewhere, and there is also a standard operating procedure, which is versioned and has date for next review.


