I have 60 samples (samp1...samp60), each one was barcoded and then pooled (10 samples/pool, 6 pools).
Each pool was sequenced in 9 lanes.
This leads to 1080 fastq files ( 60 samples * 9 lanes * 2 (PE) ) and 540 bam files.
I want to do variant calling with GATK.
I went through these two very informative posts:
Accordingly, I am trying to define the read groups for each bam file, as follows.
- ID: flowcell ID and lane ID (i.e. HNTW5BBXX_1)
- SM: the name of the sample (i.e. samp31)
- PL: ILLUMINA
- LB: lib_samp31
- PI: insert size (i.e. 200)
- PU: flowcell ID and lane ID and sample ID (i.e. HNTW5BBXX_1_samp31)
I would like to clarify the following:
- Did I get something wrong interpreting the fields?
- Could I exclude PU?, as it is not required by GATK, according to the link above. Do you usually include it anyway?
Thanks in advance!