Question: a VERY BASIC question about "add or replace groups" for BAM files
gravatar for CrazyB
5.7 years ago by
United States
CrazyB210 wrote:

Q: What should I put in the read groups in my BAM files?

Yes, I've read about the "read groups" thing on biostars (e.g. Picard provides a very useful tool to add/replace read groups). But I think I've missed something very fundamental so that I still couldn't understand what they are exactly, where I can find them, and if I cannot find them, what I should do so that downstream analyses could proceed (yes, I guess the answer is to add "some" read groups, but what exactly I should add ?)

From what I've found, ID, SM, PL, LB seem important read groups (for GATK at least). But if I am to add these read groups to my BAM files, assuming the files don't have them, can I just assign some dummy names to each of them? Okay, PL probably needs to be specific, like either illumina, solid, or others, but does it matter if I assign them all lowercase or should they be all CAP ?? What about the other RGs ?

For example, if I have only one BAM file to add/replace the read groups, could I simply assign "A", "B", "illumina" and "D" for ID, SM, PL, LB respectively.

And if I have two BAM files, could I simply assign "A1, B1, illumina, D1" for file 1 and "A2, B2, illumina, D2) for file 2?

I found that GATK forum mentioned that dummy info is OKAY, so would A,B,C,D like the examples above be fine ? And what exactly are the purposes for these read groups? If they are so essential, why couldn't they be incorporated by default when running early steps (or even 1st step, e.g. from fastq) of NGS data processing ?

Any input on any of the issues in this question will be greatly appreciated. Thank you.


read group picard • 2.5k views
ADD COMMENTlink modified 5.7 years ago by Pierre Lindenbaum131k • written 5.7 years ago by CrazyB210
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Yes, you can assign dummy names for any and all of these. The read group tags are meant to enable grouping of alignment to account for biases due to things like the library preparation, the machine things were sequenced on, etc.

This is mostly useful where you have samples that were each sequenced multiple times, but from different libraries. So then you'd have alignments with the same SM but a different LB. In cases where you just have a single run of each sample, with all samples done in a single batch, then read groups aren't particularly useful.

ADD COMMENTlink written 5.7 years ago by Devon Ryan97k

Thanks A LOT for the clarification !! From a non-techie person perspective, it's still a little odd though that the info is NOT registered in the fastq output. Shouldn't it be generated automatically when machines do the sequencing? If so, couldn't it be extracted automatically and directly from the fastq output (or whatever the raw sequencing output format is) ?

ADD REPLYlink written 5.7 years ago by CrazyB210

Oh - I apologize for not doing a more comprehensive search (I thought I did) on biostars forum before I posted my question. Apparently a similar question was asked 2.8 years ago.

ADD REPLYlink written 5.7 years ago by CrazyB210
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Q: What should I put in the read groups in my BAM files?

* group are used when calling : the group/sample-name is used by the callers to label the name of the genotype column(s)

* group can be used for QC: "how many reads for this lane/center/sample/etc.. ?"

* groups are used to by picard to remove optical duplicates.

* (...)

ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum131k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 979 users visited in the last hour