Question

Suspiciously Few Variants in BLAST Target Sequence?

0

Entering edit mode

5.8 years ago

A Soggy Waffle • 0

Hi all, I am trying to extract a gene sequence from 50 S. aureus WGS (from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP098767) in order to test for positively selected sites in dihydropteroate synthase. In brief, I align them using bwa against the reference found in the study, call variants with samtools and bcftools, then extract their consensus sequence using bcftools consensus.

To get the gene I want, I create a blast db from each consensus sequence and blast against it with my reference gene sequence. The problem is: every blast result is a 100% match, 100% identities and no gaps.

It seems fishy to me that there is not a single variant found. My query gene is dihydropteroate synthase, so maybe it is important enough to be little to no variation within a species. But then again, maybe I don't fully understand getting a consensus sequence. The VCF files of each sample share many variants (I think because of the distance each sample is from the reference genome) but do have a few variants unique to itself.

I really want to have some variants in my data or else the positive selection testing is useless. Perhaps I should just grab some samples from S. aureus species closely related to what I already have.

Thoughts on how to best move forward to get many sequences of this gene with some variation (but not too much or else the positive selection testing is useless again)?

sequence • 974 views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by A Soggy Waffle • 0

0

Entering edit mode

That sounds convoluted to me.

I would align the reads you have against the reference (but be careful if reference isn't as close as you think - you may want to de novo assemble).

Once youve done that you can 'slice up' the BAMs or VCFs etc by location to get the gene you're interested in.

No need to mess about with your consensus sequence or anything.

ADD REPLY • link 5.8 years ago by Joe 21k

0

Entering edit mode

As jrj.healey said, stop you workflow at call variants with samtools and bcftools. Then look for proper filters for applying to bacterial population datasets. The VCF file will contain the information you want (the variants) and more, like their frequency, etc. Extract consensus with bcftools consensus will bias towards the more common alleles.

I had a quick look, and novobiocin and oxacillin (the antibiotics used in the study) don't seem to target the gene you are interested in, so one would not expect much variation in it, specially SNPs reaching high frequencies.

ADD REPLY • link 5.8 years ago by h.mon 35k