Question: NGS data simulation: VarSim or BAMSurgeon?
2
gravatar for user230613
2.8 years ago by
user230613280
Europe
user230613280 wrote:

Hi there,

I want to generate NGS data to do some test and benchmark in both germline and somatic variant calling. I've read a lot of papers about different tools and different tools benchmarks but I want to know your feedback. After reading the papers, I have chosen two tools: VarSim and BAMSurgeon.

  • BAMSurgeon uses pre-existing BAM files and adds new variants to them. It's has been widely used in DREAM challenge for testing variant calling algorithms so I assume that it works really nice. Using pre-existing BAM files, the advantage is that you can real data and then introduce new variants for the benchmarking.
  • For other hand, VarSim is able to generate read files taking as input a reference genome and a set of variants. All the data here is purely simulated (well, the variants can be random or previously described ones), and the advantage is that you can somehow control different types of error (like sequencing errors and so on). And also, having fastq files it is possible to test a full pipeline of Alignment+Variant_calling workflow.

At the end, What I would like to have is set of tumor/normal pair fastq files, with a true.vcf dataset, and then be able to play and adjust different parameters like: _clonality, heterogeneity, contamination, sequencing error.._

Sorry if the question is too open or wide. I'd like to receive suggestions and personal experiences about the best way to generate this kind of data. If its specific por Exome/Target sequencing would be even better.

Thank you in advance,

bamsurgeon simulation varsim • 2.6k views
ADD COMMENTlink modified 2.8 years ago by d-cameron2.1k • written 2.8 years ago by user230613280
3
gravatar for d-cameron
2.8 years ago by
d-cameron2.1k
Australia
d-cameron2.1k wrote:

For somatic SV simulation, I'm yet to find a tool that can generate realistic data. The problem with simulating reads from the reference genome is that you present your variant caller with much easier problem that actual data. Real data is much messier (especially for repetitive sequence) and by simulating reads from the reference you will overestimate your variant callers' performance.

BAMSurgeon probably comes the closest to realistic data since it using existing sequencing data, but the types of SV events it can simulate are very limited and it does not handle some important classes of cancer driver mutations such as inter-chromosomal gene fusions. Additionally, the alignment-based event insertion approach taken by BAMSurgeon is not appropriate for repetitive regions as the BAMSurgeon approach assumes that the reads originating from the region that the event is to be simulated are correctly mapped to that region.

That said, I've used ART for SV simulation off hg19 but as you can see from my benchmarking results (http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ ), ROC curves for the simulated variants are vastly better than the ROC curves for real data. The simulations are useful for determining best-case variant caller performance (eg the smallest event size detectable by SV caller X), but should not be taken as reflecting performance on actual data.

These issues may be less problematic for SNV and small indel variants.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by d-cameron2.1k

Do you mean VarSim+Art when you say that you used Art?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by yeinhorn0

Just ART from FASTA files. I created script to generate the FASTA files since VarSim only supports simple ins/del/dup/inv SVs.

Entire classes of somatic mutations (gene fusion, chromoplexy/chromothripsis/breakage-fusion-bridge, double minutes, ...) were missing from the simulators the last time I checked. By far the biggest issue I had with somatic simulations was the lack of aneuploidy and inter-chromosomal rearrangements. The majority of the cancers I've analysed were most definitely not simple diploid genomes with some SNVs and simple local rearrangements thrown in. 50+ copies of an unmutated oncogene is not unexpected for cancers showing signs of chromothripis/breakage-fusion-bridge.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by d-cameron2.1k

I'm wondering the http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ is still available? I'm not able to see the results.

ADD REPLYlink written 12 months ago by tingting.gong0

Unfortunately not. We do have a benchmarking paper with more comprehensive results coming out soon.

ADD REPLYlink written 12 months ago by d-cameron2.1k
2
gravatar for Joseph Hughes
2.8 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

Here is a recent paper that reviews different NGS read simulators. I think the decision tree figure is useful.

I had a related question and ended up using ART.

ADD COMMENTlink modified 22 months ago • written 2.8 years ago by Joseph Hughes2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour