How To Proper Fill Up @Rg (Readgroups) Information? Example?
2
7
Entering edit mode
9.2 years ago
Carlos Borroto ★ 2.0k

Hi,

I did a search looking for the best definitive way to fill up @RG information in a BAM file. This is what I have so far. Please see the questions I still have at the end.

Let say I have two biological samples (SAMPLE01 and SAMPLE02). A library for each sample was build and sequenced twice using multiplexing in a MiSeq instrument. I them decided I wanted an extra sequence run but after building a new set of libraries. This is what I think should be the proper way to fill up @RG.

Run 1 using libraries 01:

@RG ID:SAMPLE01.R01 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01
@RG ID:SAMPLE02.R01 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01


Run 2 using libraries 01:

@RG ID:SAMPLE01.R02 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01
@RG ID:SAMPLE02.R02 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01


Run 3 using libraries 02:

@RG ID:SAMPLE01.R03 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L02
@RG ID:SAMPLE02.R03 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L02


My question is, Did I get PU right? I'm using here the FLOWCEL_ID:LANE_NUMBER. Should PU be unique among runs? Should I include the instrument run number to make it unique?

I guess I should also ask, can this be improve?, or would this be all that GATK needs?

Thanks, Carlos

gatk picard • 8.7k views
3
Entering edit mode

I think the RG is stored as a STRING for each read. So, the shorter your id is (ID:1 , ID:2 ), the smallest your bam will be.

0
Entering edit mode

That's true, but I'm working with what is going to be a very large data set. We already track samples with unique IDs. Coming up with a unique ID for RGs seems redundant. And with the size of our dataset even if I start numbering serially starting at 1, it will get large at one point. I feel is cleaner to have an easy translation from RG.ID to SAMPLE.ID.

1
Entering edit mode

Everything seems fine to me. Most of the downstream analysis tool only use information from LB, SM, RG tags. In a very few cases, PL too. I have never seen any tool using information from PU tag but its always good to have all the tags listed properly.

0
Entering edit mode

That's interesting, cause watching a video from Broad(sorry I don't the link handy) explaining the importance of properly filling up RGs, they specifically mention modeling lane errors as an example. I would guess the way GATK can tell which reads are coming from the same lane is using PU. This is also why I think PU might need to be unique among runs.

Am I wrong?

0
Entering edit mode

I think GATK uses read groups to determine if a read is coming from the same lane or not. Reads belonging to same lane should have same RG id.

0
Entering edit mode

That's not my understanding. I think each RG ID needs to be unique.

0
Entering edit mode

0
Entering edit mode

I'm sorry I can 't completely follow your thoughts. But one thing I'm sure you got wrong here is there won't ever be two reads with the same read ID. See this link: http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers

The read ids include run number of the instrument, unique instrument identifier, lane number and lane location information. This combination makes read IDs unique even if you merge fastq files from different runs or even different instruments.

Also RG IDs are only to identify different combinations of RG tags. Every time any of tags change you should use a different RG ID. Meaning if you run two samples in the same lane you have two SM tags, you then need two RG IDs. This also means you now have two RG IDs for two groups run in the same lane. There is no way RG IDs can then be used to identify reads coming from the same lane.

I hope I explain myself better. I'll keep looking around. Maybe I should ask in the GATK forum.

Thanks, for helping understand this issue. Carlos

0
Entering edit mode

after reading all the comments in this thread, I'm not sure if any downstream tools use the LB. Which ones?

Second if I we run a sample in 2 lanes, should it have the same ID and different LB? And again, what downstream tools would use that info.

1
Entering edit mode

This is what I understand so far. LB is used to mark duplicates. Each sample should be labeled with SM and belong to a different @RG ID. I have confirmation @RG ID is used in BSQR.

If you run a sample in 2 lanes, you should have 2 @RG. They both should have the same LB if they come from the same library preparation. Almost always the case. It would look like this.

@RG ID:1.1 SM:1 LB:1
@RG ID:1.2 SM:1 LB:1

0
Entering edit mode
0
Entering edit mode

The interesting part if they keep saying '@RG ID' should be set to something like FLOWCELL.ID to make it unit among all sequencing in the world and to mark all reads coming out of the same lane for the BSQR error model. The problem is, you cannot do this if your are multiplexing several samples in one lane. I can't imagine why the GATK team overlooked that case.

1
Entering edit mode
9.2 years ago
Carlos Borroto ★ 2.0k

I got my question answered by Appistry. They provided comercial support for GATK in close collaboration with The Broad.

Appistry confirmed GATK does use @RG ID to tell which reads come from the same lane. @ashutoshmits was correct in his comment above. However, this does mean there is no way to mark reads from several samples in a multiplexing run as coming from the same lane. Appistry support mentioned you still get better results from the error models and the different covariates that BQSR uses, by running GATK with as much data as possible from the same lane. Even if there is no way to tell this is the case from @RG tags.

I'm still surprised GATK didn't use PU instead. I think that would be the perfect tag to avoid this situation. Picard already made PU required in their AddOrReplaceReadGroups tool. GATK however does not require PU.

--Carlos

0
Entering edit mode

0
Entering edit mode
9.2 years ago
Mitch Bekritsky ★ 1.3k

According to the SAM format specification:

PU: Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identi er.

I see you linked the SAM specification in your question. Have you gone over it yet?

0
Entering edit mode

Well, I didn't link the spec. It seems biostar.org did that for me.

But yes, I did look into the spec and that was the reason I used the flowcell:lane syntax. Do you know if the flowcell ID I see in the fastq read header if unique for the instrument or for the run?

This is the kind of read ID I'm seeing: @M00941:81:000000000-A5NM7:1:1101:14552:1574

Here the flowcell id is "000000000-A5NM7" the lane number is "1".

0
Entering edit mode

Ah...sneaky! I'm pretty sure the read header is unique to the run. For instance, reads coming from the sequencing facility on my campus have headers looking something like (instrument name):(some number):(flowcell name):(lane):(other numbers). Do you have any file or LIMs that associates each sequencing file to a particular flowcell? That may be the easiest way to generate a PU tag.

0
Entering edit mode

We are hiring a lab to do the sequencing for us and my role is to make sure they provide BAM files with the proper information. Not clear to me how they are going to do it. I would guess they do have a LIMS. They are a pretty big lab.

0
Entering edit mode

I would imagine the flowcell ID is unique as well. They are in the sequencing facility on my campus, but I would check with the sequencing lab you'll be working with.