Ok so I have some Illumina whole genome bams. They have been aligned by illumina using Casava but we wanted to re-run the alignments ourselves using bwa so I have used bam2fastq and extracted the paired end sequences. These have then been sucessfully aligned using bwa mem to produce a sam file that is then converted to a new, sorted and indexed bam file.
So far so good.
I want to use tools from the GATK and will need to insert readgroup data in order to do so. Each bam I have represents a single sample from a single library prep but they were run on multiple lanes (typically 3) as indicated from the fastq files and from the QNAME variable in the sam file.
For example: HS2000-1259_127:3:1210:15640:52255
With, I believe, 3 being the flowcell lane.
As the sample was the same and it's from the same library prep can I ignore the fact they were run on different lanes or is it necessary to individually tag each read with a read group according to it's flowcell lane?
If it is necessary to tag each read separately, am I correct in thinking that Picard's AddOrReplaceGroups is not capable of doing this (or at least not without splitting the bam up first, running picard then remerging)? And if it isn't, has someone already written something to carry out this task? (I'm sure I could probably whip something up, but there's no point reinventing the wheel!).
Thanks in advance.