BWA-MEM read groups usage
1
1
Entering edit mode
4.1 years ago
magnuskerber ▴ 20

So, I'm through the process of learning how to perform variant calling and doing some reading/tutorials, etc. I'm using the "BWA-MEM -> Samtools mpileup" strategy to do so.

I figured that some people use bwa-mem with the -R option, which will add read group information to samples. But I did not get how useful that is in my case because some people also don't apply the -R option. Basically I´m analyzing plant organisms WGS data. The samples are all siblings derived from a crossing. It`s paired end data and my sample nomenclature follows the pattern:

SampleID_plateID_wellID_R1.fastq.gz

So, I arleady generated a VCF file containing all the identified SNPs. I ran BWA-MEM without the -R option. No warning was generated nor mpileup complained about anything. So, it's best for me to use the -R option or that wouldn't change much?

Thanks for the help.

BWA-MEM alignment read groups • 4.9k views
ADD COMMENT
4
Entering edit mode

It doesn't cost much to add a read-group and you might need it in the future...

ADD REPLY
0
Entering edit mode

Hi guys, thank you all for the answers. So I figured its better to add the read groups. I was reading some articles, including this one https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472 but I didn`t get how does bwa-mem gets the information of lane/well/run of the samples and assign the groups. I believe this information is encoded in the FASTQ files, is that correct?

ADD REPLY
1
Entering edit mode

Some of this information is indeed in the Fastq hear for Illumina data. You can find an example here. Aligners don't get that information automatically though. You will need to specify it on your command line.

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

Hi, @genomax I added the read group successfully. But something odd happened, my samples number kinda downed. It was supposed to be 295 but after adding the read group my samples number downed to 291. The program starts now like: "[mpileup] 291 samples in 295 input files".

ADD REPLY
0
Entering edit mode

my first thought would then be that your read group IDs are not unique among all samples (that is thus likely the reason why the number of samples decreased) and they are thus recognised as the same sample.

Double check if the read groups you added are all unique.

ADD REPLY
0
Entering edit mode

Yes, I thought about that. I ran mpileup before for a smaller set of those samples and this shrinking in the number of samples didn't happened. I'm using a bash script to prospect the read group of those samples, which I got from this topic. The part where I read the .fastq file to get read group is:

header=$(gunzip -c ${arr[$i]} | head -n 1)

id=$(echo $header | head -n 1 | cut -f 1-4 -d":" | sed 's/@//' | sed 's/:/_/g')

sm=$(echo $header | head -n 1 | grep -Eo "[ATGCN]+$")

echo "Read Group read@RG\tID:$id\tSM:$id"_"$sm\tLB:$id"_"$sm\tPL:ILLUMINA"
ADD REPLY
0
Entering edit mode

hmm, if the header you extract is not unique (can't make this up from what you posted) , there is a real chance you read groups will also not be unique.

can you add more info so we can have a more detailed look at what's going on?

ADD REPLY
0
Entering edit mode

So, I did some digging on my raw reads and found 5 samples with the exact same read group naming. Since 5 samples are counting as 1 that explains the math of having "4 less samples" in the end.

I know they came from different individuals because there is a table with the code names for each archive/sample. So, what is the technical explanation for that? Why would my sequence provider do such thing?

ADD REPLY
0
Entering edit mode

that we can't say for you I'm afraid ... the only one that can answer that is your sequence provider ;-)

perhaps they did some technical replicates because the initial yield was not enough?

ADD REPLY
0
Entering edit mode

Sequence providers generally pass names of samples along as submitted. Is it possible that same sample may have been run multiple times (for whatever reason)? If you know for sure that samples came from different individuals then you would want to account for that by changing the names accordingly.

ADD REPLY
2
Entering edit mode
4.1 years ago

Many moons ago, I did a lot of variant calling with a simple mpileup-based pipeline. That software can take a list of separate bam files and call the SNPs all together. I didn't need read groups. However, I learned when I wanted to use a different software, FreeBayes, that that software doesn't take in a list of bams. It wants all the data combined into one bam, with read groups distinguishing the samples. So I used picard tools to add read groups to my bams, and then combined them, and everything was fine.

So don't worry. If your pipeline works fine, it's fine. If someday you have to add read groups, that can be done to the bam; you won't have to regenerate it.

ADD COMMENT
1
Entering edit mode

Freebayes at least now takes a list of BAMs.

My worst read group experience was having to match bam read groups and VCF read groups exactly with a program for allele-specific RNA-seq across many samples.

@magnuskerber, there is a program called addrg in conda I believe (part of freebayes ?) where you can very simply add read groups. As Pierre says, it costs little and potentially adds value to your data (eg if doing multisample SNV calling in future with Freebayes).

ADD REPLY

Login before adding your answer.

Traffic: 2941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6