Question

Merge BAM files for a lane with multiplexed samples.

0

Entering edit mode

8.9 years ago

Timothée Flutre ▴ 20

I have N individuals (say, ind01, ..., ind96), all sequenced on the same lane, say the third one the flowcell, and I followed the htslib workflow:

First, I demultiplexed the raw reads to get one fastq file per individual.
Then, I aligned them on the reference genome with bwa using option -R '@RG\tID:3\tSM:ind01' for the first individual, and so on. (I then used samtools fixmate, and samtools sort.)

Thus, I have one bam file per individual (N in total for the lane).

I now would like to do local indel realignment as well as base quality score recalibration with GATK. As the tools work per lane, I would like to merge all bam files obtained at the end of step 2 above. But I don't know how to handle the @RG tags.

According to the SAM specification, each @RG line must have a unique ID. As BaseRecalibrator is read-group-aware, the ID should be "3". But I will then loose the information of which reads come from which individual (the sample tag, SM). I read the documentation on "samtools merge", but it doesn't look like it can handle my case.

Any idea?

Of course, once this 3rd step is done, I will want to split again the bam file into one bam file per individual. Moreover, as the same individual may be sequenced on several lanes, I will want to merge the bam files per individual across lanes before doing the SNP and genotype calling.

multiplexing samtools baserecalibrator gatk • 4.4k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

The per-lane stuff in the "GATK best practices" should be renamed "per-sample per-lane" steps. So there's no reason to merge different individuals from the same lane.

Do you have "ID:3" for all of your samples? That would be unusual.

ADD REPLY • link 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

Should I conclude from your message that one cannot use GATK BaseRecalibrator with multiplexed data?

ADD REPLY • link 8.9 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

No, that would in no way be a fair interpretation of what I wrote.

ADD REPLY • link 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

OK, sorry, I misunderstood your point.

I understand that I shouldn't use the same ID for different individuals. I can easily fix that. But then I don't know how to use BaseRecalibrator on all individuals of the same lane, since all ID will be different.

My problem comes from the fact that GATK documentation says: "The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group".

What you advise is to use a single ID per individual in the lane. But what if a given individual has less than 100M bases? Moreover, I don't expect the "individual" effect to be very strong compare to the "lane" effect, "machine cycle" effect, "dinucleotide" effect, etc. That's why I initially used "ID=3" to remove confounders globally for the whole lane, but this obliges me to loose the information on which read comes from which individuals.

Hope I'm clearer.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

GATK tools are read group aware, so you can run the BaseRecalibrator on a merge lane-level BAM file just fine.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Sean Davis 26k

0

Entering edit mode

Thanks, it seems indeed that the only possibility is to fit the model in BaseRecalibrator per individual, even if all individuals are merged into a single bam file with one read group per individual.

ADD REPLY • link 8.9 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

"ID" is actually supposed to be a unique identifier for the lane-library-flowcell combination. "LB" is for library, and "SM" is for sample. So, the IDs should be different between each of your 96 individuals.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Sean Davis 26k

0

Entering edit mode

In my case (individuals multiplexed on the same lane), would you put the same value LB = SM = ind01, etc, for each individual?

ADD REPLY • link 8.9 years ago by Timothée Flutre ▴ 20

0

Entering edit mode

SM = LB = ind01, SM = LB = ind02, SM=LB=ind03, etc. This is assuming that there is only one library per sample.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Sean Davis 26k