Question: Virus mutation definition sequencing
0
gravatar for Anna S
4.4 years ago by
Anna S500
PSU
Anna S500 wrote:

Hello,

What is the customary mutation threshold for considering a site to be mutated in a virus?

I'm working on a project where apparently viruses have different mutations depending on the ethnic background of the host.  When I mapped the virus sequences to the virus genome, I noticed that there were indeed several mutations at >= 20%  (that is, >= 20% of the reads have a different nucleotide at a particular genome position than the reference nucleotide).  I was wondering what is the customary threshold used?  This is a new field for me and I'm trying to read the literature to figure this out, but I know how awesome you biostars are, and so here I am!

Thank you!

Anna

sequencing mutation virus • 1.4k views
ADD COMMENTlink modified 4.4 years ago by pld4.8k • written 4.4 years ago by Anna S500
2
gravatar for pld
4.4 years ago by
pld4.8k
United States
pld4.8k wrote:

You'll want to actually get into the variant calling. Although GATK isn't designed for haploid organisms (ssRNA viruses for me), I've had success using that pipeline to call variants.

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.4 years ago by pld4.8k

Hi, what do you use for known sites, an empty file? The BaseRecalibrator step "does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality". Is this an appropriate assumption for viruses? Thanks.

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.4 years ago by Anna S500

BaseRecalibrator requires a list of validated variants, such as dbSNP. There are somewhat arbitrary guidelines for creating your own list (see here), but it's probably not relevant for your experiment.

ADD REPLYlink written 4.4 years ago by harold.smith.tarheel4.5k

Thanks. IndelRealigner and BaseRecalibrator are not relevant. So I went directly from remove duplicates to HaplotypeCaller. The command ran successfully with no errors but resulted in 0 variants! Perhaps because a mutation rate of 20% at a genome position is not considered 'significant' for humans, which is what GATK was designed for?

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.4 years ago by Anna S500

I'll try next creating a vcf file with the mutations I identified manually and using this as a "known" vcf db file and see if GATK calls these.

ADD REPLYlink written 4.4 years ago by Anna S500

FYI. Not called even when in 'known' vcf file.I think I need to go the old fashioned way and read the literature!!

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.4 years ago by Anna S500

You have to remember that calling variants isn't as simple as counting the number of bases at a given position that don't match the reference. You have to consider the quality of the read, mapping quality, location of the aligned position within the read and etc.

You wouldn't want to call SNPs off of low quality bases located in the 3' end of the read, they're likely to be

ADD REPLYlink written 4.4 years ago by pld4.8k

Give me a moment, I'll pull up a script I have. It won't translate directly, and include the whole process from raw fq data, but you can see the steps I took. Check my OP.

EDIT: Sorry, it is very long, I'll give you the gist here.

Assuming you've assembled and etc:

  1. Generate raw VCF using samtools mpileup
  2. Use picard to mark dups
  3. GATK, follow their pipeline: RealignerTargetCreator -> IndelRealigner -> BaseRecalibrator -> AnalyzeCovariates (optional, but a good QC step) -> PrintReads -> UnifiedGenotyper -> VariantsToTable

The step VariantsToTable is what give you the final VCF, with properly called variants.

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.4 years ago by pld4.8k

Thank you, Joe!

ADD REPLYlink written 4.4 years ago by Anna S500
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1949 users visited in the last hour