Question

Variant calling on single-cell data

1

Entering edit mode

7.5 years ago

vinvan ▴ 50

As part of a study on cancer relapse upon treatment, we performed single cell RNAseq and single cell DNA seq on primary tumor tissue before treatment and after relapse. Now, my collaborators would like to know whether certain variants that are present in a set of cancer driving genes in the relapsed tumor were already present as sub clones before treatment or alternatively, that new variants of these genes have emerged.

The single cell sequencing libraries were made according to two different protocols: SmartSeq2 or G&T seq. The coverage in the SmartSeq2 samples is low compared to the samples produced with the G&T protocol. So, I now have a few hundred BAM files from samples made according to either one of these protocols but also quite a few question marks on how to tackle this request since I'm new to variant calling…

Naïvely, my first approach would be to merge and realign or recalibrate all the BAM files from each timepoint separately and run a variant calling pipeline on the resulting datasets. Can this work or am I missing something?

More in general, I haven't found many studies trying this particular approach so is it even possible to do variant calling with single cell data or will the number of artifacts from the amplification process make it impossible to detect any true variants? I would think that any variant found in multiple cells has a high probability of being a true variant? The fact that the G&T seq returns data on the genome and the transcriptome of the same cell should help in detecting true positives so this could be used as a sort of extra validation?

Thanks for your input!

variant single-cell SNP • 3.9k views

ADD COMMENT • link updated 7.2 years ago by oneillkza ▴ 110 • written 7.5 years ago by vinvan ▴ 50

score 2 · Answer 1 · 2018-05-09

I have to do something similar fairly soon, and have been considering using a variant caller designed for single-cell data (most likely monovar). https://www.nature.com/articles/nmeth.3835

Also, since this is cancer and the list of commonly-occurring driver genes (and mutations) is well known, I would feed the variant callers just the data from those genes/regions rather than the whole genome. This should greatly reduce the false positive rate. Then you should be able to estimate clonal frequencies for each variant.

I suspect merging all the bam files (assuming they're from the same patient) and running a traditional variant caller would also be a good cross-check. As you say, having the RNA-seq will help a lot with cross-checking variants (where they're located within exons, anyway).

However, the advantage of the single-cell caller is that you also have something akin to phasing information in knowing which variants co-occur in which cells. This will be somewhat hampered by the lack of coverage, but should at least give you some indication.

For that matter, I'd really like to see someone work on making a tool to stitch together this kind of clonal information from single-cell variant data. I don't believe such a thing exists, but it would be pretty useful!