Question: How To Set Rg Header For Tablerecalibration?
1
gravatar for PeterPan
6.8 years ago by
PeterPan30
ShangHai
PeterPan30 wrote:

hi, everyone~ I am using GATK recently. And I also use Picard's AddOrReplaceReadGroups to add RG header. Also I checked http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups, I still don't understand what those RG headers are used for.

For example, I pooled several samples for sequencing. I have a bam file sequenced from sample "NA006", and this sample blongs to library "TOTAL", and this sample is sequenced in Lane "NO1", and bar-code is "ATCG", sequence platform is "Illumina".

How could these information be added into RG headers like RGID, RGLB, RGPU and RGSM?

And I think these information are useful in TableRecalibration step, because batch effects exisits.

Thanks!

gatk picard • 2.5k views
ADD COMMENTlink written 6.8 years ago by PeterPan30
3
gravatar for Vikas Bansal
6.8 years ago by
Vikas Bansal2.3k
Berlin, Germany
Vikas Bansal2.3k wrote:

A very good example is given at Galaxy.

Example of Read Group usage

Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

So then I guess for your example it should be-

@RG     ID:FLOWCELL1.LANE1.NA006      PL:illumina     LB:Total   SM:NA006      PU:ATCG

I assigned RG ID randomly, you can decide but it should be unique.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Vikas Bansal2.3k

Thanks very much, Vikas.

ADD REPLYlink written 6.8 years ago by PeterPan30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 989 users visited in the last hour