Hi, I was wondering if somebody can give me more details about read groups (RG) and how programs (picard, samtool) use them.
What does a read group exactly describe? When should I make a different read group?
As far as I understand, e read group should be assigend to a certain library, sequenced in a certain moment. That means that two libraries will always have different RG ID but the same library could have several different RG. Is that right? Anything else should trigger a creation of a new RG?
When programs use RG information (such as Picard when marking reads duplicate) do they actually compare the libraries? So If I merge files with, let's say, 3 different libraries and 10 different RG ID and then mark duplicates, do they look for duplicates only within a given library? Or across all reads? or only within a given RG ID?
If you have link to sites with some detailed explanation, that would be very useful.
I'm trying to find confirmation that MarkDuplicates actually does marking on a per "Library" basis. Does anyone know if this is documented somewhere? It doesn't seem obvious from the Picard manual.
"MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes." https://software.broadinstitute.org/gatk/documentation/article.php?id=6472