How to preserve Sample Identity within the metadata of FASTQ, BAM and VCF files; CAP AMP 2018 Guidelines - Recommendation 10
0
0
Entering edit mode
5.6 years ago
gsr9999 ▴ 300

Hi BioStar Leaders,

In our lab, I have designed an exome pipeline based on BWA and GATK-HaplotypeCaller and it confirms to 2016/2017 guidelines from College of American Pathologists(CAP) and Association for Medical Pathology(AMP).

Recently in 2018, the AMP-CAP published 2018 guidelines here (PMID: 29154853), and their Recommendation 10 states that the Sample Identity should be preserved with 4 identifiers(Sample Id, Patient Id, Run Id, Lab_Location Id) inside the metadata of the FASTQ, BAM and VCF files.

For VCF files, I was able to easily add the 4 identifiers into the VCF file's meta-information lines ## by using the tool "bcftools annotate -h"

For BAM files, I used these following "read group" tags to insert 3 of the 4 identifiers : CN tag for LabLocation Id, ID tag for Run Id, SM tag for Sample Id. In the SAM/BAM format specification I could not find a place holder aka. tag to insert Patient Id and I wonder where one can put the Patient Id in the SAM/BAM file ? I am wondering if the DS/Description tag be used as a substitute to put Patient Id in it ?

For FASTQ files, the format is rigid and 'Run Id' is the only identifier that goes into the "Sequence Identifier" row; and I wonder if it is possible to insert Sample Id, Patient Id, and Lab_Location Id into the FASTQ files ???

It would be great to hear ideas from folks who tried to make their pipelines comply with the latest 2018 guidelines ?

Thanks, gsr

next-gen alignment sequencing • 1.6k views
ADD COMMENT
2
Entering edit mode

For FASTQ files, the format is rigid and 'Run Id' is the only identifier that goes into the "Sequence Identifier" row; and I wonder if it is possible to insert Sample Id, Patient Id, and Lab_Location Id into the FASTQ files ???

use unsorted/unmapped bam (UBAM) to store the reads with a read group.

ADD REPLY
0
Entering edit mode

I wonder if it is possible to insert Sample Id, Patient Id, and Lab_Location Id into the FASTQ files ???

You could do that put information in the header but then the header would no longer resemble standard ones (though there is no common format) produced by current sequencers. You would also have to replicate that information in each read header which would be an incredible waste of disk space.

ADD REPLY

Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6