Hi BioStar Leaders,
In our lab, I have designed an exome pipeline based on BWA and GATK-HaplotypeCaller and it confirms to 2016/2017 guidelines from College of American Pathologists(CAP) and Association for Medical Pathology(AMP).
Recently in 2018, the AMP-CAP published 2018 guidelines here (PMID: 29154853), and their Recommendation 10 states that the Sample Identity should be preserved with 4 identifiers(Sample Id, Patient Id, Run Id, Lab_Location Id) inside the metadata of the FASTQ, BAM and VCF files.
For VCF files, I was able to easily add the 4 identifiers into the VCF file's meta-information lines ## by using the tool "bcftools annotate -h"
For BAM files, I used these following "read group" tags to insert 3 of the 4 identifiers : CN tag for LabLocation Id, ID tag for Run Id, SM tag for Sample Id. In the SAM/BAM format specification I could not find a place holder aka. tag to insert Patient Id and I wonder where one can put the Patient Id in the SAM/BAM file ? I am wondering if the DS/Description tag be used as a substitute to put Patient Id in it ?
For FASTQ files, the format is rigid and 'Run Id' is the only identifier that goes into the "Sequence Identifier" row; and I wonder if it is possible to insert Sample Id, Patient Id, and Lab_Location Id into the FASTQ files ???
It would be great to hear ideas from folks who tried to make their pipelines comply with the latest 2018 guidelines ?
Thanks, gsr
use unsorted/unmapped bam (UBAM) to store the reads with a read group.
You could do that put information in the header but then the header would no longer resemble standard ones (though there is no common format) produced by current sequencers. You would also have to replicate that information in each read header which would be an incredible waste of disk space.