Q: What should I put in the read groups in my BAM files?
Yes, I've read about the "read groups" thing on biostars (e.g. Picard provides a very useful tool to add/replace read groups). But I think I've missed something very fundamental so that I still couldn't understand what they are exactly, where I can find them, and if I cannot find them, what I should do so that downstream analyses could proceed (yes, I guess the answer is to add "some" read groups, but what exactly I should add ?)
From what I've found, ID, SM, PL, LB seem important read groups (for GATK at least). But if I am to add these read groups to my BAM files, assuming the files don't have them, can I just assign some dummy names to each of them? Okay, PL probably needs to be specific, like either illumina, solid, or others, but does it matter if I assign them all lowercase or should they be all CAP ?? What about the other RGs ?
For example, if I have only one BAM file to add/replace the read groups, could I simply assign "A", "B", "illumina" and "D" for ID, SM, PL, LB respectively.
And if I have two BAM files, could I simply assign "A1, B1, illumina, D1" for file 1 and "A2, B2, illumina, D2) for file 2?
I found that GATK forum mentioned that dummy info is OKAY, so would A,B,C,D like the examples above be fine ? And what exactly are the purposes for these read groups? If they are so essential, why couldn't they be incorporated by default when running early steps (or even 1st step, e.g. from fastq) of NGS data processing ?
Any input on any of the issues in this question will be greatly appreciated. Thank you.