Question: Merge BAM files for a lane with multiplexed samples.
0
gravatar for Timothée Flutre
4.3 years ago by
France
Timothée Flutre20 wrote:

I have N individuals (say, ind01, ..., ind96), all sequenced on the same lane, say the third one the flowcell, and I followed the htslib workflow:

  1. First, I demultiplexed the raw reads to get one fastq file per individual.
  2. Then, I aligned them on the reference genome with bwa using option -R '@RG\tID:3\tSM:ind01' for the first individual, and so on. (I then used samtools fixmate, and samtools sort.)

Thus, I have one bam file per individual (N in total for the lane).

I now would like to do local indel realignment as well as base quality score recalibration with GATK. As the tools work per lane, I would like to merge all bam files obtained at the end of step 2 above. But I don't know how to handle the @RG tags.

According to the SAM specification, each @RG line must have a unique ID. As BaseRecalibrator is read-group-aware, the ID should be "3". But I will then loose the information of which reads come from which individual (the sample tag, SM). I read the documentation on "samtools merge", but it doesn't look like it can handle my case.

Any idea?

Of course, once this 3rd step is done, I will want to split again the bam file into one bam file per individual. Moreover, as the same individual may be sequenced on several lanes, I will want to merge the bam files per individual across lanes before doing the SNP and genotype calling.

ADD COMMENTlink modified 4.1 years ago by Biostar ♦♦ 20 • written 4.3 years ago by Timothée Flutre20

The per-lane stuff in the "GATK best practices" should be renamed "per-sample per-lane" steps. So there's no reason to merge different individuals from the same lane.

Do you have "ID:3" for all of your samples? That would be unusual.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Devon Ryan91k

Should I conclude from your message that one cannot use GATK BaseRecalibrator with multiplexed data?

ADD REPLYlink written 4.3 years ago by Timothée Flutre20

No, that would in no way be a fair interpretation of what I wrote.

ADD REPLYlink written 4.3 years ago by Devon Ryan91k

Ok, sorry, I misunderstood your point.

I understant that I shouldn't use the same ID for different individuals. I can easily fix that. But then I don't know how to use BaseRecalibrator on all individuals of the same lane, since all ID will be different.

My problem comes from the fact that GATK documentation says: " The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group".

What you advise is to use a single ID per individual in the lane. But what if a given individual has less than 100M bases? Moreover, I don't expect the "individual" effect to be very strong compare to the "lane" effect, "machine cycle" effect, "dinucletide" effect, etc. That's why I initially used "ID=3" to remove confounders globally for the whole lane, but this obliges me to loose the information on which read comes from which individuals.

Hope I'm clearer.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Timothée Flutre20

GATK tools are read group aware, so you can run the BaseRecalibrator on a merge lane-level BAM file just fine.  

ADD REPLYlink written 4.3 years ago by Sean Davis25k

Thanks, it seems indeed that the only possibility is to fit the model in BaseRecalibrator per individual, even if all individuals are merged into a single bam file with one read group per individual.

ADD REPLYlink written 4.3 years ago by Timothée Flutre20

"ID" is actually supposed to be a unique identifier for the lane-library-flowcell combination.  "LB" is for library, and "SM" is for sample.  So, the IDs should be different between each of your 96 individuals.

ADD REPLYlink written 4.3 years ago by Sean Davis25k

In my case (individuals multiplexed on the same lane), would you put the same value LB = SM = ind01, etc, for each individual?

ADD REPLYlink written 4.3 years ago by Timothée Flutre20

SM = LB = ind01, SM = LB = ind02, SM=LB=ind03, etc.  This is assuming that there is only one library per sample.

ADD REPLYlink written 4.3 years ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2627 users visited in the last hour