Question: Suspiciously Few Variants in BLAST Target Sequence?
0
gravatar for A Soggy Waffle
16 months ago by
A Soggy Waffle0 wrote:

Hi all, I am trying to extract a gene sequence from 50 S. aureus WGS (from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP098767) in order to test for positively selected sites in dihydropteroate synthase. In brief, I align them using bwa against the reference found in the study, call variants with samtools and bcftools, then extract their consensus sequence using bcftools consensus.

To get the gene I want, I create a blast db from each consensus sequence and blast against it with my reference gene sequence. The problem is: every blast result is a 100% match, 100% identities and no gaps.

It seems fishy to me that there is not a single variant found. My query gene is dihydropteroate synthase, so maybe it is important enough to be little to no variation within a species. But then again, maybe I don't fully understand getting a consensus sequence. The VCF files of each sample share many variants (I think because of the distance each sample is from the reference genome) but do have a few variants unique to itself.

I really want to have some variants in my data or else the positive selection testing is useless. Perhaps I should just grab some samples from S. aureus species closely related to what I already have.

Thoughts on how to best move forward to get many sequences of this gene with some variation (but not too much or else the positive selection testing is useless again)?

sequence • 338 views
ADD COMMENTlink modified 16 months ago by h.mon28k • written 16 months ago by A Soggy Waffle0

That sounds convoluted to me.

I would align the reads you have against the reference (but be careful if reference isn't as close as you think - you may want to de novo assemble).

Once youve done that you can 'slice up' the BAMs or VCFs etc by location to get the gene you're interested in.

No need to mess about with your consensus sequence or anything.

ADD REPLYlink modified 16 months ago • written 16 months ago by Joe15k

As jrj.healey said, stop you workflow at call variants with samtools and bcftools. Then look for proper filters for applying to bacterial population datasets. The VCF file will contain the information you want (the variants) and more, like their frequency, etc. Extract consensus with bcftools consensus will bias towards the more common alleles.

I had a quick look, and novobiocin and oxacillin (the antibiotics used in the study) don't seem to target the gene you are interested in, so one would not expect much variation in it, specially SNPs reaching high frequencies.

ADD REPLYlink written 16 months ago by h.mon28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 654 users visited in the last hour