I did a search looking for the best definitive way to fill up @RG information in a BAM file. This is what I have so far. Please see the questions I still have at the end.
Let say I have two biological samples (SAMPLE01 and SAMPLE02). A library for each sample was build and sequenced twice using multiplexing in a MiSeq instrument. I them decided I wanted an extra sequence run but after building a new set of libraries. This is what I think should be the proper way to fill up @RG.
Run 1 using libraries 01:
@RG ID:SAMPLE01.R01 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01 @RG ID:SAMPLE02.R01 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01
Run 2 using libraries 01:
@RG ID:SAMPLE01.R02 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01 @RG ID:SAMPLE02.R02 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01
Run 3 using libraries 02:
@RG ID:SAMPLE01.R03 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L02 @RG ID:SAMPLE02.R03 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L02
My question is, Did I get PU right? I'm using here the FLOWCEL_ID:LANE_NUMBER. Should PU be unique among runs? Should I include the instrument run number to make it unique?
I guess I should also ask, can this be improve?, or would this be all that GATK needs?
I think the RG is stored as a STRING for each read. So, the shorter your id is (ID:1 , ID:2 ), the smallest your bam will be.
That's true, but I'm working with what is going to be a very large data set. We already track samples with unique IDs. Coming up with a unique ID for RGs seems redundant. And with the size of our dataset even if I start numbering serially starting at 1, it will get large at one point. I feel is cleaner to have an easy translation from RG.ID to SAMPLE.ID.
Everything seems fine to me. Most of the downstream analysis tool only use information from LB, SM, RG tags. In a very few cases, PL too. I have never seen any tool using information from PU tag but its always good to have all the tags listed properly.
That's interesting, cause watching a video from Broad(sorry I don't the link handy) explaining the importance of properly filling up RGs, they specifically mention modeling lane errors as an example. I would guess the way GATK can tell which reads are coming from the same lane is using PU. This is also why I think PU might need to be unique among runs.
Am I wrong?
I think GATK uses read groups to determine if a read is coming from the same lane or not. Reads belonging to same lane should have same RG id.
That's not my understanding. I think each RG ID needs to be unique.
Yes RG ids should be unique. Sequencers can sequence sample using different lanes. Lets assume you ran a library using lane number 1. Then you reran the same library in the same lane or some other lane. Now read numbers or read header numbers are assigned from a finite set of read numbers. It may happen that some reads from the first run have same read ids as the second run. Similarly, runs in different lanes can produce reads with same read ids. Now if you will merge these two fastq files and align it may throw some error or if you merge the individual bam files it may throw some error that read id already exists. So you need to provide unique RG IDs for these two runs or reads belonging to different lanes. That was the primary purpose of RG IDs. But now GATK also uses it to for BQSR.
I'm sorry I can 't completely follow your thoughts. But one thing I'm sure you got wrong here is there won't ever be two reads with the same read ID. See this link: http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers
The read ids include run number of the instrument, unique instrument identifier, lane number and lane location information. This combination makes read IDs unique even if you merge fastq files from different runs or even different instruments.
Also RG IDs are only to identify different combinations of RG tags. Every time any of tags change you should use a different RG ID. Meaning if you run two samples in the same lane you have two SM tags, you then need two RG IDs. This also means you now have two RG IDs for two groups run in the same lane. There is no way RG IDs can then be used to identify reads coming from the same lane.
I hope I explain myself better. I'll keep looking around. Maybe I should ask in the GATK forum.
Thanks, for helping understand this issue. Carlos
after reading all the comments in this thread, I'm not sure if any downstream tools use the LB. Which ones?
Second if I we run a sample in 2 lanes, should it have the same ID and different LB? And again, what downstream tools would use that info.
This is what I understand so far. LB is used to mark duplicates. Each sample should be labeled with SM and belong to a different @RG ID. I have confirmation @RG ID is used in BSQR.
If you run a sample in 2 lanes, you should have 2 @RG. They both should have the same LB if they come from the same library preparation. Almost always the case. It would look like this.
I found point 9 on this page to be very helpful: http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files
The interesting part if they keep saying '@RG ID' should be set to something like FLOWCELL.ID to make it unit among all sequencing in the world and to mark all reads coming out of the same lane for the BSQR error model. The problem is, you cannot do this if your are multiplexing several samples in one lane. I can't imagine why the GATK team overlooked that case.