Question: How to discover new mutations in a derived strain relative to a reference genome?
gravatar for vladimir.gritsenko
3.8 years ago by
Tel Aviv University
vladimir.gritsenko70 wrote:

(Warning: newbie question.)

The problem: we have a reference strain (diploid) and three experimental strains derived from it. We have the FASTA sequence of the reference, with both homologs for each chromosome. We also sequenced the three experimental strains (in FASTQ format). We need to know the difference between the experimental strain and the reference strain - what genetic changes did they undergo (if any)?

As I see it, the straightforward way would be to get a list of all SNPs in the reference strain (as a VCF file), then perform SNP calls on each strain, then filter out the reference SNPs from the strain SNPs. I figured out how to do the SNP calls on the strains and I found tools that can filter a VCF against another VCF, but I don't know how to perform the first step - getting a list of SNPs from the reference FASTA. How can this be done? Alternatively, what are other standard ways of discovering mutations?

(There was also a suggestion of comparing the VCFs of the three strains against each other, but that sounds inelegant. What if we only had one derived strain?)

Thanks in advance!

sequencing genome • 1.6k views
ADD COMMENTlink modified 3.7 years ago by Len Trigg1.2k • written 3.8 years ago by vladimir.gritsenko70

if you have your reference strain as fasta, why don't you use it as your actual "reference" in the analysis. I mean, you can for example align your experimental strain fastq reads to it and then perform a snp calling on this alignment. I don't know which programs did you use, but I think this is always possible and it is the most reasonable approach.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Martombo2.4k

I'm a little confused as to how this would work. When I do a SNP call, I do it against a haplotype, not the full diploid reference. Since there's no special meaning to chromosomes in FASTA (or is there?), wouldn't the homologs be counted as different chromosomes, which would jumble the whole thing?

ADD REPLYlink written 3.8 years ago by vladimir.gritsenko70

I see...
sorry, I didn't think about that. 
talking about the mapping step: having both versions of the chromosomes should not be a problem. if a read has a variation that is present in one of the two alleles, it will be mapped only to that one and no mismatch would be observed. But I'm not an expert on SNP calling, so I will let others comment on that step.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Martombo2.4k
gravatar for mark.ziemann
3.7 years ago by
mark.ziemann1.1k wrote:

Varscan has a somatic mutation caller tool. Check this from the varscan page:

java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename

The above command will report germline, somatic, and LOH events at positions where both normal and tumor samples have sufficient coverage (default: 8). If the --validation flag is set to 1, it also report non-variant (reference) calls at all positions with sufficient coverage, for the purpose of refuting false positives with validation data. 


ADD COMMENTlink written 3.7 years ago by mark.ziemann1.1k

Please correct me if I'm wrong, but this compares between two pileups. In our case, the reference (the parent) strain already has a consensus sequence in the form of a FASTA file. I think that if I figure out how to get a VCF out of a diploid FASTA, I can then compare the VCFs directly.

ADD REPLYlink written 3.7 years ago by vladimir.gritsenko70
gravatar for Len Trigg
3.7 years ago by
Len Trigg1.2k
New Zealand
Len Trigg1.2k wrote:

RTG Core includes a new cell lineage caller that may be a close fit for what you want, since it supports more than two samples (strains) simultaneously.

The lineage command performs Bayesian joint variant calling on all the samples (strains) simultaneously and flags variants within any derived sample as de-novo if they are new with respect to the parent sample (annotated with a de-novo-specific confidence score). You supply a PED file that describes the lineage structure you want to call with respect to (e.g. depending on your experimental setup your three experimental strains may be separately derived from the reference strain, or in a serial progression). (With a simple two-sample original/derived scenario it's quite similar to a somatic caller, but we do also have a special purpose somatic caller for that)

RTG Core is free for non-commercial use, and is very easy to use, so give it a go. Feel free to get in touch to let us know how you get on.


ADD COMMENTlink written 3.7 years ago by Len Trigg1.2k

If I understand correctly, the lineage approach is similar to comparing VCFs of the derived strains. It also requires an SDF file, and I'm unfortunately not familiar with this format, and the pedigree file is perhaps possible to create, but these aren't human cells, but yeast cells, who reproduce asexually.

Regarding the somatic caller, that may work, but we have a consensus sequence FASTA file for the reference genome, not a BAM.

ADD REPLYlink written 3.7 years ago by vladimir.gritsenko70

[The PED file is just a convenient format for specifying the lineage (the user manual describes the interpretation in the case of original/derived as you have, but obviously it also handles relationships from sexual reproduction), and rtg SDFs are created with rtg format to provide random access, avoid repeated parsing, etc (think of it analogous to bwa index).]

The lineage approach is conceptually better than comparing VCFs as it incorporates all the data jointly when making the calls and produce calls that overall more consistent.

Your main difficulty here is that you are using reference containing both homologs, and current variant calling tools (and even mappers really) assume that the reference is a haploid representation. Mapping to a reference containing both homologs should (ignoring sequencing error) appropriately assign reads to the correct homolog at sites that are variant in the reference. However it depends on the mapper as to what it would do with reads that align equally well to both homologs (and this is highly likely to happen with reads that are de novo variant in the strains). Generally they will all assign low mapq to the mappings, leading variant callers to essentially disregard those mappings.

One quick way that might work would be to subset your reference (R), taking one of the homologs as "canonical" to make a haploid reference (R'), then simulate reads from R, mapping, variant calling (possibly joint lineage) all the samples (both the simulated reference reads and the real strain data) with respect to R'. (rtg includes all the components you would need to do that). It's probably quick and easy enough to give it a try anyway.

Another approach is to look into assembly graph construction techniques that you could feed your diploid reference into in order to construct a reference graph, and then experiment with bleeding edge reference graph mappers/variant callers (I haven't tried it, but vg may be useful).

Hope this helps!


ADD REPLYlink written 3.7 years ago by Len Trigg1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2444 users visited in the last hour