Hi all, I am trying to extract a gene sequence from 50 S. aureus WGS (from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP098767) in order to test for positively selected sites in dihydropteroate synthase. In brief, I align them using bwa against the reference found in the study, call variants with samtools and bcftools, then extract their consensus sequence using bcftools consensus.
To get the gene I want, I create a blast db from each consensus sequence and blast against it with my reference gene sequence. The problem is: every blast result is a 100% match, 100% identities and no gaps.
It seems fishy to me that there is not a single variant found. My query gene is dihydropteroate synthase, so maybe it is important enough to be little to no variation within a species. But then again, maybe I don't fully understand getting a consensus sequence. The VCF files of each sample share many variants (I think because of the distance each sample is from the reference genome) but do have a few variants unique to itself.
I really want to have some variants in my data or else the positive selection testing is useless. Perhaps I should just grab some samples from S. aureus species closely related to what I already have.
Thoughts on how to best move forward to get many sequences of this gene with some variation (but not too much or else the positive selection testing is useless again)?