Question

Adding Multiple Read Group Headers to a single BAM file

0

Entering edit mode

8.6 years ago

koen.vdberge ▴ 30

I have a BAM file which consists of sequencing reads derived from pooling 25 individuals and sequencing the pool of DNA. Variant calling, or calculating minor allele frequencies, requires one to specify that the BAM file actually consists of 25 samples, rather than one. This is usually done by specifying read group headers in the BAM header and for each read. For samtools, it seems like specifying any read group to a read is fine, and we just need to make sure the BAM file contains 25 read groups in total.

If I am correct, this would not be the case if I would call variants using, for example, the GATK framework, as stated in their manual: "a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model"

However, the bwa -R command you mention only allows us to add one read group to all reads in a BAM file. Since I have 25 pooled samples within the same BAM, this option would not be possible for me. This is similar for other read group add commands such as the one from the freebayes software: https://github.com/ekg/bamaddrg or the AddOrReplaceReadGroups command from Picard tools.

Therefore, two questions:

(a) How can I add multiple read groups to a single BAM file (simultaneously?), allowing me to do correct variant calling?

(b) Does it really matter whether I specify this for the pileup file? It most likely does when eventually calling variants, but the pileup file does no such thing.

bam samtools • 3.0k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by koen.vdberge ▴ 30

1

Entering edit mode

how do you know which read nº1 should be assigned to the group 'g1' and read nº2 should be assigned to the group 'g2' and ... etc...?

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

You basically don't. I think all that matters is specifying that it's a pooled sample, because this blows up the variability in assessing allele frequencies.

ADD REPLY • link 8.6 years ago by koen.vdberge ▴ 30

1

Entering edit mode

Yeah, that's not going to work. Each alignment itself needs to be associated to a particular read group, otherwise they'll just get ignored.

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

What about just modifying the ploidy settings?

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for your answer. Adjusting ploidy settings is not possible within samtools mpileup, but maybe with bcftools?

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.6 years ago by koen.vdberge ▴ 30

0

Entering edit mode

Possibly, I know GATK has a ploidy option, don't know about bcftools.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

Ram · Answer 1 · 2015-10-13

1

Entering edit mode

8.6 years ago

Chris Fields ★ 2.2k

AFAIK, you can't add multiple read groups to one BAM file using any tool I know of, for the reasons Devon and Pierre mention. The typical path is to align the samples independently (with read groups added at that step, such as using the 'bwa -R' option, or just after using one of the above tools). The independent alignments are then post-processed (sorted, realigned, etc) and merged prior to variant calling with the read groups in place already.

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thank you for your answer. However, in your approach, there is no cost efficiency of pooling samples, if I am right. I only have one FASTA file for the entire pooled sample.

ADD REPLY • link 8.5 years ago by koen.vdberge ▴ 30