I have N individuals (say, ind01, ..., ind96), all sequenced on the same lane, say the third one the flowcell, and I followed the htslib workflow:
- First, I demultiplexed the raw reads to get one fastq file per individual.
- Then, I aligned them on the reference genome with bwa using option -R '@RG\tID:3\tSM:ind01' for the first individual, and so on. (I then used samtools fixmate, and samtools sort.)
Thus, I have one bam file per individual (N in total for the lane).
I now would like to do local indel realignment as well as base quality score recalibration with GATK. As the tools work per lane, I would like to merge all bam files obtained at the end of step 2 above. But I don't know how to handle the @RG tags.
According to the SAM specification, each @RG line must have a unique ID. As BaseRecalibrator is read-group-aware, the ID should be "3". But I will then loose the information of which reads come from which individual (the sample tag, SM). I read the documentation on "samtools merge", but it doesn't look like it can handle my case.
Of course, once this 3rd step is done, I will want to split again the bam file into one bam file per individual. Moreover, as the same individual may be sequenced on several lanes, I will want to merge the bam files per individual across lanes before doing the SNP and genotype calling.