Question: Read Group In Sam/Bam Files: What Do They Exactly Describe?
gravatar for Stefano Berri
8.6 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Hi, I was wondering if somebody can give me more details about read groups (RG) and how programs (picard, samtool) use them.

What does a read group exactly describe? When should I make a different read group?

As far as I understand, e read group should be assigend to a certain library, sequenced in a certain moment. That means that two libraries will always have different RG ID but the same library could have several different RG. Is that right? Anything else should trigger a creation of a new RG?

When programs use RG information (such as Picard when marking reads duplicate) do they actually compare the libraries? So If I merge files with, let's say, 3 different libraries and 10 different RG ID and then mark duplicates, do they look for duplicates only within a given library? Or across all reads? or only within a given RG ID?

If you have link to sites with some detailed explanation, that would be very useful.


next-gen samtools bam • 32k views
ADD COMMENTlink modified 4.6 years ago by bshifaw50 • written 8.6 years ago by Stefano Berri4.1k
gravatar for David Quigley
8.6 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

Quoting from the GATK FAQ:

Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

...a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model...GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample.

My understanding is that a read group means, roughly, "a set of reads that were all the product of a single sequencing run on one lane". If you have multiplexed samples in a single lane, you will get multiple samples in a single read group. If you sequenced the same sample in several lanes, you will have multiple read groups for the same sample.

ADD COMMENTlink modified 11 months ago by _r_am31k • written 8.6 years ago by David Quigley11k
gravatar for Wen.Huang
8.6 years ago by
Wen.Huang1.2k wrote:

I am pretty sure MarkDuplicates does the marking per "Library". This is why RG (which contains library info) is very critical information especially you have multiple libraries in the same bam. If you have multiple RG within a library (which is not uncommon), all RG are considered together.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Wen.Huang1.2k

I'm trying to find confirmation that MarkDuplicates actually does marking on a per "Library" basis.  Does anyone know if this is documented somewhere?  It doesn't seem obvious from the Picard manual.

ADD REPLYlink written 5.9 years ago by Malachi Griffith18k

"MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes."

ADD REPLYlink written 18 months ago by Lee Baker20
gravatar for bshifaw
4.6 years ago by
United States
bshifaw50 wrote:

The GATK website has a detailed explanation for read groups found in their dictionary. Here is an excerpt:

"Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups."

ADD COMMENTlink written 4.6 years ago by bshifaw50

To be honest, haha, when i read that first time I was just more confused. I think a simpler description is that the RGID is just some text that you can stick on to every read individually. It's supposed to be in a format that uniquely identifies that read and where it came from. So my RGIDs all look like [sequencer_id].[flowcell_id].[lane_number].[adapter_sequence_or_NA] for example HWI-ST552.C5WGPACXX.1.AGTTC - note that the order and dots are totally irrellevent. There is no standard here.

So in the same BAM file you can have reads from different lanes/adaptors/etc but still compute things like quality score bias on each lane/adaptor individually. There's a bit more to it though - in the header of the BAM there will (hopefully) be a row for each of the RGIDs found in the BAM with some extra data there too, like the PU/PL/SM, etc (check the SAM spec), but that stuff is really fairly irrelevant. Sample (SM) is important for GATK, since it wont merge variants unless they come from the same sample, but thats it. If you're worried about space, it's good to choose a small RGID, since it's stored in ASCII for every single read.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by John12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1638 users visited in the last hour