Read Group In Sam/Bam Files: What Do They Exactly Describe?
3
38
Entering edit mode
10.3 years ago

Hi, I was wondering if somebody can give me more details about read groups (RG) and how programs (picard, samtool) use them.

What does a read group exactly describe? When should I make a different read group?

As far as I understand, e read group should be assigend to a certain library, sequenced in a certain moment. That means that two libraries will always have different RG ID but the same library could have several different RG. Is that right? Anything else should trigger a creation of a new RG?

When programs use RG information (such as Picard when marking reads duplicate) do they actually compare the libraries? So If I merge files with, let's say, 3 different libraries and 10 different RG ID and then mark duplicates, do they look for duplicates only within a given library? Or across all reads? or only within a given RG ID?

If you have link to sites with some detailed explanation, that would be very useful.

Thanks

bam samtools next-gen • 38k views
24
Entering edit mode
10.3 years ago

Quoting from the GATK FAQ:

Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

...a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model...GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample.

My understanding is that a read group means, roughly, "a set of reads that were all the product of a single sequencing run on one lane". If you have multiplexed samples in a single lane, you will get multiple samples in a single read group. If you sequenced the same sample in several lanes, you will have multiple read groups for the same sample.

8
Entering edit mode
10.3 years ago
Wen.Huang ★ 1.2k

I am pretty sure MarkDuplicates does the marking per "Library". This is why RG (which contains library info) is very critical information especially you have multiple libraries in the same bam. If you have multiple RG within a library (which is not uncommon), all RG are considered together.

0
Entering edit mode

I'm trying to find confirmation that MarkDuplicates actually does marking on a per "Library" basis. Does anyone know if this is documented somewhere? It doesn't seem obvious from the Picard manual.

0
Entering edit mode

"MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes." https://software.broadinstitute.org/gatk/documentation/article.php?id=6472

4
Entering edit mode
6.3 years ago
bshifaw ▴ 50

The GATK website has a detailed explanation for read groups found in their dictionary. Here is an excerpt:

"Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups."

3
Entering edit mode

To be honest, haha, when i read that first time I was just more confused. I think a simpler description is that the RGID is just some text that you can stick on to every read individually. It's supposed to be in a format that uniquely identifies that read and where it came from. So my RGIDs all look like [sequencer_id].[flowcell_id].[lane_number].[adapter_sequence_or_NA] for example HWI-ST552.C5WGPACXX.1.AGTTC - note that the order and dots are totally irrellevent. There is no standard here.

So in the same BAM file you can have reads from different lanes/adaptors/etc but still compute things like quality score bias on each lane/adaptor individually. There's a bit more to it though - in the header of the BAM there will (hopefully) be a row for each of the RGIDs found in the BAM with some extra data there too, like the PU/PL/SM, etc (check the SAM spec), but that stuff is really fairly irrelevant. Sample (SM) is important for GATK, since it wont merge variants unless they come from the same sample, but thats it. If you're worried about space, it's good to choose a small RGID, since it's stored in ASCII for every single read.