So, I'm through the process of learning how to perform variant calling and doing some reading/tutorials, etc. I'm using the "BWA-MEM -> Samtools mpileup" strategy to do so.
I figured that some people use bwa-mem with the -R option, which will add read group information to samples. But I did not get how useful that is in my case because some people also don't apply the -R option. Basically I´m analyzing plant organisms WGS data. The samples are all siblings derived from a crossing. It`s paired end data and my sample nomenclature follows the pattern:
SampleID_plateID_wellID_R1.fastq.gz
So, I arleady generated a VCF file containing all the identified SNPs. I ran BWA-MEM without the -R option. No warning was generated nor mpileup complained about anything. So, it's best for me to use the -R option or that wouldn't change much?
Thanks for the help.
It doesn't cost much to add a read-group and you might need it in the future...
Hi guys, thank you all for the answers. So I figured its better to add the read groups. I was reading some articles, including this one https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472 but I didn`t get how does bwa-mem gets the information of lane/well/run of the samples and assign the groups. I believe this information is encoded in the FASTQ files, is that correct?
Some of this information is indeed in the Fastq hear for Illumina data. You can find an example here. Aligners don't get that information automatically though. You will need to specify it on your command line.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.Hi, @genomax I added the read group successfully. But something odd happened, my samples number kinda downed. It was supposed to be 295 but after adding the read group my samples number downed to 291. The program starts now like: "[mpileup] 291 samples in 295 input files".
my first thought would then be that your read group IDs are not unique among all samples (that is thus likely the reason why the number of samples decreased) and they are thus recognised as the same sample.
Double check if the read groups you added are all unique.
Yes, I thought about that. I ran mpileup before for a smaller set of those samples and this shrinking in the number of samples didn't happened. I'm using a bash script to prospect the read group of those samples, which I got from this topic. The part where I read the .fastq file to get read group is:
hmm, if the header you extract is not unique (can't make this up from what you posted) , there is a real chance you read groups will also not be unique.
can you add more info so we can have a more detailed look at what's going on?
So, I did some digging on my raw reads and found 5 samples with the exact same read group naming. Since 5 samples are counting as 1 that explains the math of having "4 less samples" in the end.
I know they came from different individuals because there is a table with the code names for each archive/sample. So, what is the technical explanation for that? Why would my sequence provider do such thing?
that we can't say for you I'm afraid ... the only one that can answer that is your sequence provider ;-)
perhaps they did some technical replicates because the initial yield was not enough?
Sequence providers generally pass names of samples along as submitted. Is it possible that same sample may have been run multiple times (for whatever reason)? If you know for sure that samples came from different individuals then you would want to account for that by changing the names accordingly.