I'm currently working on a project where I'm researching and comparing currently available tools/algorithms for ctDNA analysis (circulating tumour DNA, a small part of the cell free DNA present in plasma). I have plasma, buffycoat and tumour data from each of my patients, but I cannot think of a way to assess the results of the tools I am using to analyse the plasma for tumour content. I am focusing on CNAs so far, but due to tumour heterogeneity I am hestitant to only "confirm" findings in plasma that are also present in the tumour samples. Another suggestion from my research team was to use the results of their current preferred analytical tool, ichorCNA, as the ground truth, and benchmark all other tools against it.
So my thoughts now are to use a few different popular tools on the tumour data, and identify a couple of large amplifications/deletions, and somehow synthetically embed them into the buffycoat data. This could e.g. be done by sampling BAF and logR ratios at several non-consecutive heterozygous SNPs in the CNA events, and replace the values at consecutive het SNPs in the buffycoat data by these new values. I could then admix the new synthetic tumour samples with the original "clean" buffycoat data at known ratios to simulate the small but varying amounts of tumour-derived DNA of plasma samples. Unfortunately this would not capture any of the fragment length differences that are characteristic of cfDNA, but it would at least give me a good ground truth dataset as a basis for comparison.
My question is if anyone would know how to go about this? From the few articles I have read that attempted something similar, their methodology sections have been extremely vague and of no help at all. One article mentioned using BAMsurgeon, but other than that I am at a bit of a loss.
Any help or advice is highly appreciated.
Thanks for your reply. Yes I suppose you're right about using the tumour sequences. I was hoping to avoid the effects of normal cell contamination in the solid tumour biopsies, but these will of course still be present if I sample CNA events from the solid tumour in any case. Do you have any recommendations for how to admix my samples at varying ratios in silico? It would especially be helpful if it is possible to choose which tumour reads to include so I can have control over exactly how many events are included in the synthetic dataset, and their sizes/locations.
On the topic of cfDNA fragment lengths, there is of course remarkable consistency in fragment lengths for non-tumour derived cfDNA (166bp), whereas tumour-derived cfDNA has been shown to have shorter fragments (and a few much longer ones, according to some publications). It would be interesting to include in the synthetic dataset if possible, since many variant callers rely on fragment lengths for ctDNA identification, but I am assuming this will be too difficult to simulate.
I was not aware of this thread of the literature; thank you for sharing. To simulate this difference you would need to take a VCF-aware read simulator (like SimuSCoP or NEAT) and simulate germline mutations at one insert size, and germline+somatic mutations at a separate insert size, and admix the reads at the appropriate ratio. To account for sub-clonal tumor fractions you may need to simulate several (germline+somatic) pairs, one for each subclone (i.e., containing somatic variants at X% fraction and above) or use a simulator capable of simulating tumor fractions.
For actually mixing, this can basically be done by randomly downsampling the simulated fastq files to contain an appropriate number of reads and then concatenating them.