I am trying to run GATK DepthOfCoverage on some BAM files that I have merged from two original files (the same sample was sequenced on two lanes to maximize the number of reads). I realized after the fact that my merged file has reads with different read groups (as reflected by the RG field of each read), and that the header of my two original files differed in their @RG fields.
I have tried running samtools reheader adding a new @RG field in the header, but when I merge the two files each read group is based on the name of the two BAM files, not on the name of the @RG in the headers of the two BAM files.
For example, my two starting samples are:
but when I merge them using samtools merge
samtools merge merged.bam 27163.pe.markdup.bam 27091.pe.markdup.bam
The resulting merged.bam has the same @RG field in the header as only one of the two, and each of the reads has a read name based on the name of the file it came from as such:
Read 1 will have RG:Z:27091.pe.markdup
Read 2 will have RG:Z:27163.pe.markdup
etcetera for the rest of the reads in the BAM
Am I doing something wrong? Should I rehead each of the original files before merging? Or simply rehead after merging to something that is compatible with GATK? It seems like no matter what the @RG field in the header is before merging, the merged file will always have reads with different RGs based on the name of the two input files.
I am also not sure what does GATK DepthOfCoverage want as input in terms of read groups. Does it want a single RG for all reads? In that case, should I use something different than samtools merge?
Thanks in advance for any help you can give me.