Question: BWA-MEM read groups usage
gravatar for magnuskerber
9 months ago by
magnuskerber10 wrote:

So, I'm through the process of learning how to perform variant calling and doing some reading/tutorials, etc. I'm using the "BWA-MEM -> Samtools mpileup" strategy to do so.

I figured that some people use bwa-mem with the -R option, which will add read group information to samples. But I did not get how useful that is in my case because some people also don't apply the -R option. Basically I´m analyzing plant organisms WGS data. The samples are all siblings derived from a crossing. It`s paired end data and my sample nomenclature follows the pattern:


So, I arleady generated a VCF file containing all the identified SNPs. I ran BWA-MEM without the -R option. No warning was generated nor mpileup complained about anything. So, it's best for me to use the -R option or that wouldn't change much?

Thanks for the help.

alignment read groups bwa-mem • 477 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by magnuskerber10

It doesn't cost much to add a read-group and you might need it in the future...

ADD REPLYlink written 9 months ago by Pierre Lindenbaum133k

Hi guys, thank you all for the answers. So I figured its better to add the read groups. I was reading some articles, including this one but I didn`t get how does bwa-mem gets the information of lane/well/run of the samples and assign the groups. I believe this information is encoded in the FASTQ files, is that correct?

ADD REPLYlink modified 9 months ago • written 9 months ago by magnuskerber10

Some of this information is indeed in the Fastq hear for Illumina data. You can find an example here. Aligners don't get that information automatically though. You will need to specify it on your command line.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink modified 9 months ago • written 9 months ago by GenoMax94k

Hi, @genomax I added the read group successfully. But something odd happened, my samples number kinda downed. It was supposed to be 295 but after adding the read group my samples number downed to 291. The program starts now like: "[mpileup] 291 samples in 295 input files".

ADD REPLYlink written 9 months ago by magnuskerber10

my first thought would then be that your read group IDs are not unique among all samples (that is thus likely the reason why the number of samples decreased) and they are thus recognised as the same sample.

Double check if the read groups you added are all unique.

ADD REPLYlink written 9 months ago by lieven.sterck9.4k

Yes, I thought about that. I ran mpileup before for a smaller set of those samples and this shrinking in the number of samples didn't happened. I'm using a bash script to prospect the read group of those samples, which I got from this topic. The part where I read the .fastq file to get read group is:

header=$(gunzip -c ${arr[$i]} | head -n 1)

id=$(echo $header | head -n 1 | cut -f 1-4 -d":" | sed 's/@//' | sed 's/:/_/g')

sm=$(echo $header | head -n 1 | grep -Eo "[ATGCN]+$")

echo "Read Group read@RG\tID:$id\tSM:$id"_"$sm\tLB:$id"_"$sm\tPL:ILLUMINA"
ADD REPLYlink modified 9 months ago by GenoMax94k • written 9 months ago by magnuskerber10

hmm, if the header you extract is not unique (can't make this up from what you posted) , there is a real chance you read groups will also not be unique.

can you add more info so we can have a more detailed look at what's going on?

ADD REPLYlink written 9 months ago by lieven.sterck9.4k

So, I did some digging on my raw reads and found 5 samples with the exact same read group naming. Since 5 samples are counting as 1 that explains the math of having "4 less samples" in the end.

I know they came from different individuals because there is a table with the code names for each archive/sample. So, what is the technical explanation for that? Why would my sequence provider do such thing?

ADD REPLYlink written 9 months ago by magnuskerber10

that we can't say for you I'm afraid ... the only one that can answer that is your sequence provider ;-)

perhaps they did some technical replicates because the initial yield was not enough?

ADD REPLYlink written 9 months ago by lieven.sterck9.4k

Sequence providers generally pass names of samples along as submitted. Is it possible that same sample may have been run multiple times (for whatever reason)? If you know for sure that samples came from different individuals then you would want to account for that by changing the names accordingly.

ADD REPLYlink written 9 months ago by GenoMax94k
gravatar for swbarnes2
9 months ago by
United States
swbarnes29.4k wrote:

Many moons ago, I did a lot of variant calling with a simple mpileup-based pipeline. That software can take a list of separate bam files and call the SNPs all together. I didn't need read groups. However, I learned when I wanted to use a different software, FreeBayes, that that software doesn't take in a list of bams. It wants all the data combined into one bam, with read groups distinguishing the samples. So I used picard tools to add read groups to my bams, and then combined them, and everything was fine.

So don't worry. If your pipeline works fine, it's fine. If someday you have to add read groups, that can be done to the bam; you won't have to regenerate it.

ADD COMMENTlink written 9 months ago by swbarnes29.4k

Freebayes at least now takes a list of BAMs.

My worst read group experience was having to match bam read groups and VCF read groups exactly with a program for allele-specific RNA-seq across many samples.

@magnuskerber, there is a program called addrg in conda I believe (part of freebayes ?) where you can very simply add read groups. As Pierre says, it costs little and potentially adds value to your data (eg if doing multisample SNV calling in future with Freebayes).

ADD REPLYlink written 9 months ago by colindaven2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2055 users visited in the last hour