Should the LB field in the SAM specification refer to the library preparation for the sample, or the library preparation carried out by the sequencing centre? Say I have a sample sequenced on multiple lanes of a single flowcell/machine, should they have the same library name? Or what if I have a sample which was sequenced on one lane/flowcell/machine on a certain date, and then sequenced again on a different lane/flowcell/machine. Would the reads from these two runs have the same library name?
My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates on the merged BAM. However I recently contacted GATK to ask whether read group information was necessary in this context and the answer was yes (http://gatkforums.broadinstitute.org/gatk/discussion/9310/read-group-information-required-for-markduplicates).
This confused me because if your sample was produced from a single library then merging and duplicate removal based on the 5' position alone should remove all duplicates (optical and library)?