How To Set Rg Header For Tablerecalibration?
1
1
Entering edit mode
11.9 years ago
PeterPan ▴ 30

hi, everyone~ I am using GATK recently. And I also use Picard's AddOrReplaceReadGroups to add RG header. Also I checked http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups, I still don't understand what those RG headers are used for.

For example, I pooled several samples for sequencing. I have a bam file sequenced from sample "NA006", and this sample blongs to library "TOTAL", and this sample is sequenced in Lane "NO1", and bar-code is "ATCG", sequence platform is "Illumina".

How could these information be added into RG headers like RGID, RGLB, RGPU and RGSM?

And I think these information are useful in TableRecalibration step, because batch effects exisits.

Thanks!

picard gatk • 3.8k views
ADD COMMENT
3
Entering edit mode
11.9 years ago
Vikas Bansal ★ 2.4k

A very good example is given at Galaxy.

Example of Read Group usage

Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

So then I guess for your example it should be-

@RG     ID:FLOWCELL1.LANE1.NA006      PL:illumina     LB:Total   SM:NA006      PU:ATCG

I assigned RG ID randomly, you can decide but it should be unique.

ADD COMMENT
0
Entering edit mode

Thanks very much, Vikas.

ADD REPLY

Login before adding your answer.

Traffic: 2096 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6