Question: dNdScv error: Zero coding substitutions found in this dataset
gravatar for F
7 days ago by
F3.3k wrote:


I have a set of whole genome sequencing data, I want to find driver mutations so I am trying to use dndscv tool. But, I am permanently getting this error

> dndsout = dndscv(mutation)
[1] Loading the environment...
[2] Annotating the mutations...
Error in dndscv(r) : 
  Zero coding substitutions found in this dataset. Unable to run dndscv.
In addition: Warning messages:
1: In dndscv(r) :
  Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
2: In dndscv(r) :
  Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.
3: In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

This is the header and tail of my data

> head(mutation)
  sampleID  chr      pos ref  mut
1 Sample_1 chr1  7914928   A ATAC
2 Sample_1 chr1 15080062   A   AG
3 Sample_1 chr1 16714609   C  CAT
4 Sample_1 chr1 19078857  TA    T
5 Sample_1 chr1 22621124  TA    T
6 Sample_1 chr1 34556929  TA    T
> tail(mutation)
      sampleID    chr      pos ref mut
85912 Sample_4 hs37d5 33864042   A   G
85913 Sample_4 hs37d5 34253273   C   A
85914 Sample_4 hs37d5 35186366   A   C
85915 Sample_4 hs37d5 35222851   A   G
85916 Sample_4 hs37d5 35232459   A   C
85917 Sample_4 hs37d5 35295511   A   G

Anyone knows how I can solve this error?

wgs dndscv R • 270 views
ADD COMMENTlink modified 7 days ago by Inigo Martincorena100 • written 7 days ago by F3.3k
gravatar for Inigo Martincorena
7 days ago by
Sanger Institute, Cambridge, UK
Inigo Martincorena100 wrote:

Thanks for trying the dNdScv package and the feedback. I suspect that the problem is that your chromosome names are different to those used by dNdScv. Your chromosome names contain the prefix "chr", while the reference object used by dNdScv for GRCh37/hg19 does not use this. As a result, the method does not find any valid coding mutation in your data. Indels are handled by dNdScv, so they should not be a problem.

Try this:

mutation$chr = gsub("chr","",as.vector(mutation$chr))
dndsout = dndscv(mutation)

I will try to improve the error message generated when the chromosome names do not match to avoid this confusion in the future.

ADD COMMENTlink written 7 days ago by Inigo Martincorena100

Thank you for your time assigning to my problem, without asking for your help for sure for days I had to wandering in google but I know I would not find the solution myself

My data not only contained chr prefix but also something like hs37d5 and chrM and so on, so by your kindly comment I removed all then code worked. However, I am not sure if it is right to ignoring these contigs.

ADD REPLYlink modified 7 days ago • written 7 days ago by F3.3k

I think it is fine to dismiss those contigs to study selection and to search for cancer genes. However, if you are very interested in including these contigs in the analysis, you can build your own RefCDS object using the buildref function in the package (see a tutorial below). But I would not worry about this unless you have a particular interest in these contigs.

The mitochondrial genome (chrM) is a special one and cannot be analysed with the rest of the genome. The reasons are that the genetic code is different (which matters when running dN/dS) and that the mitochondrial genome has very different mutation rates and spectrum that need to be modelled separately. For cancer gene discovery it is reasonable to exclude the mitochondrial genome, however, if you are interested in studying selection in the mitochondria you can again use the buildref function to build an object for it. The mitochondria has an additional challenge in that there is a very strong replication strand asymmetry. So, if you were interested in mitochondria, I would recommend running dNdScv separately for the MT-ND6 gene (see paper below, but note that the dN/dS model that we used in this paper was developed before dNdScv).

But I think that almost all users can rely on the default object from dNdScv and not worry about the additional contigs or about the mitochondrial genome.

ADD REPLYlink modified 7 days ago • written 7 days ago by Inigo Martincorena100


For getting better results, do you suggest me to use original vcf files as input for your software or I should first filtrate vcf file for minor allele frequency and variant with deleterious impact of the protein?

ADD REPLYlink written 6 days ago by F3.3k

I would need to know more about your project to give you a good answer. It all depends on how clean your table of somatic mutations is. Many false positives and artefacts tend to have low allele frequencies, and filtering them out may improve the results. However, if your dataset is clean, you don't need to do that. You should also never filter your mutations based on their impact to the protein before running dNdScv, because the method expects an unbiased dataset (e.g. dN/dS needs synonymous mutations to model the mutation rates).

ADD REPLYlink written 6 days ago by Inigo Martincorena100

Thanks a lot, very very appreciated as I was struggling especially about the filtration based on influencing proteins.

My project is responding or non-responding to chemotherapy in oesophagus adenocarcinoma; So, we have whole genome sequencing of responder patients (tumour vs normal tissue) and matched non-responder patients to chemotherapy. Being responders or non-responders has been assessed by mandard scoring (TRG). However, as different patients carrying different and heterogeneous tumours, I am not sure if finding the driver genes could help in answering what is the difference in the genomes of responders versus non-responders.

The authors of

The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic highly recommended me to use dNdScv as their best performing tool, so I hope I could find a signature in responders and non-responders to chemotherapy by finding significant driver genes. Please kindly give me a clue if I am in wrong path to answer my question.

ADD REPLYlink written 6 days ago by F3.3k

You can certainly try dNdScv, running it separately on each set of patients and on both sets together. This will give you lists of significant genes and some measure of selection. However, dNdScv is not designed to look for differences between two cohorts, so you may want to consider alternative approaches (e.g. statistically compare the fraction of patients with a driver mutation in a given gene in both sets of patients). I would also advice you to contact the authors of that paper for advice, since they will be more familiar with the type of cancer that you are analysing.

ADD REPLYlink written 6 days ago by Inigo Martincorena100


I likely analyzed my data with dNdScv successfully; However, I am not sure how I could achieve three ouputs

1- For getting the percentage of driver mutations in each gene and in different mutation classes I am not sure which table gives me these. For example

> print(head(sel_cv), digits = 3)
      gene_name n_syn n_mis n_non n_spl wmis_cv wnon_cv wspl_cv
18057      TP53     0    14     2     1   247.5     622     622
3794      CHST2     0     0     2     0     0.0    1006    1006
16660     SPTA1     0     9     0     0    20.3       0       0
19410   ZDHHC17     0     0     0     2     0.0     268     268
9327   KRTAP2-1     0     0     1     0     0.0   11304   11304
19877     ZNF66     0     3     0     0    36.8       0       0
       pmis_cv ptrunc_cv pallsubs_cv qmis_cv qtrunc_cv qallsubs_cv
18057 0.00e+00  5.04e-08    0.00e+00  0.0000   0.00101      0.0000
3794  7.32e-01  2.40e-06    1.32e-05  0.9353   0.02407      0.0905
16660 2.75e-06  7.46e-01    1.35e-05  0.0276   0.99405      0.0905
19410 7.26e-01  3.78e-05    1.83e-04  0.9353   0.25348      0.9211
9327  9.51e-01  5.48e-05    2.92e-04  0.9616   0.27548      0.9999
19877 2.96e-04  9.18e-01    1.41e-03  0.9353   0.99405      0.9999
> print(dndsout$globaldnds)
     name       mle     cilow   cihigh
wmis wmis 0.9989565 0.9198342 1.084885
wnon wnon 1.0253553 0.8406774 1.250603
wspl wspl 1.1058024 0.8414807 1.453152
wtru wtru 1.0509338 0.8894579 1.241725
wall wall 1.0025578 0.9241182 1.087655

If I am not wrong for TP53 the class of mutation is missence?

2- Output of the mean number of exonic driver mutations per case

3- Passenger mutation rates

I am sorry to be this much disturbing

ADD REPLYlink written 2 days ago by F3.3k

Hello, The outputs that you are asking for are not standard outputs from dndscv. They can be obtained with some additional calculations based on the outputs of dNdScv and following the methods of my paper in Cell. However, the risk of these measures being misused or misinterpreted by users without a good understanding of dNdScv was so high, that I decided not to make them part of the standard output.

Answering your questions specifically:

  1. Confidence intervals for dN/dS values per gene are not available and they would ideally be needed for the question that you ask. I will consider adding them to future versions of dNdScv. In the meantime, note that the columns wmis_cv and wnon_cv are the maximum likelihood estimates (point estimates) of the dN/dS values per gene. For example, for TP53 dN/dS ratios for missense and truncating (nonsense + essential splice site) mutations are 247.5 and 622, respectively. dN/dS ratios measure the excess of mutations observed over those expected by chance under neutrality. So, you can calculate the excess of missense mutations in TP53 as: (247.5-1)/247.5*14 = 13.94. That is, the rate of passenger missense mutations expected in TP53 under neutrality was very low (<<1) and most likely all 14 missense mutations seeing in TP53 are drivers. As I said, ideally one would use confidence intervals for the dN/dS values to add uncertainty to these calculations. Also, note that in the absence of confidence intervals, I would only calculate these values for genes with q-value<0.05 (in this case only TP53).

  2. The global estimate for the number of driver mutations per exome can be obtained using dndsout$globaldnds with the code below. Where nmuts is the number of mutations of each class (missense and nonsense+essential_splice) and nsamples is the number of samples (or exomes) studied. Note, however, that a number of assumptions are used in these calculations, so I would not use any of this unless you are confident that you understand the methodology and the principles of dNdScv and the suitability of your data.

    dnds = dndsout$globaldnds[c(1,4),2:4] # Global dN/dS estimates with CI95% for missense and truncating subs 
    dnds[dnds<1] = 1 # Bounded by 1 (since we are estimating excesses) 
    drivers_per_sample = (dnds-1)/dnds[,1]*nmuts/nsamples
  3. Not sure what you mean by "passenger mutation rates". But the tutorial contains information that may be useful in this direction.

If you are trying to recapitulate the analyses of a given paper, like the one you mentioned before on esophageal adenocarcinomas, I would advice you to directly contact the authors for advice (and code) to reproduce their analyses.

I hope it helps. -Inigo

ADD REPLYlink modified 1 day ago by RamRS20k • written 1 day ago by Inigo Martincorena100

Thank you very much to be this much patient and helpful. Actually I have my own independent data with different question than the paper mentioned initially. The authors just suggested your software as their best performing tool for finding genes under positive selection. However I will need a lot of time to grab what you kindly described here. Sorry but now,

By q-value only a few genes will remain for me (one or two gene including TP53), could I instead use p-value for reporting a list of possible driver genes ?

Thank you again

ADD REPLYlink modified 1 day ago • written 1 day ago by F3.3k

Not really, when calculating p-values for 20,000 genes, you expect many (~1,000) to have p-values<0.05 by chance. q-values are corrected for multiple testing to avoid this and so q-values should always be used. You can use a less conservative q-value cutoff (e.g. 0.10) but I would be careful as you can start to include false positives in your list of significant genes.

There is a better way of increasing power to detect significance in known cancer genes, and that is running dNdScv on a list of known cancer genes. Or else, perform FDR correction (p.adjust function) on a list of cancer genes using the output from dNdScv (which is a better way of doing it as you use all genes to run dNdScv). In any case, this will only increase the power on already known cancer genes and needs to be reported as Restricted Hypothesis Testing (see example)

data("cancergenes_cgc81", package="dndscv") # Genes in the Cancer Gene Census (v81)
dndsout = dndscv(mutations, gene_list=known_cancergenes)
ADD REPLYlink modified 1 day ago by RamRS20k • written 1 day ago by Inigo Martincorena100
gravatar for RamRS
7 days ago by
Houston, TX
RamRS20k wrote:

Zero coding substitutions found

There you go - it’s a clear error message.

ADD COMMENTlink written 7 days ago by RamRS20k

Sorry could you please help me to see what does this say?

ADD REPLYlink written 7 days ago by F3.3k

Try and answer the following questions yourself:

  • What does the tool do?
  • What sort of mutations does it absolutely need to perform its job?
  • What do you think the error message says?
ADD REPLYlink modified 7 days ago • written 7 days ago by RamRS20k
gravatar for zx8754
7 days ago by
zx87546.5k wrote:

This tool only works with SNV, meaning in your input dataframe the columns ref and alt can only have below letters:

c("A","C","G","T") - see #L116

And the error is triggered at this step:

if (!any(snv)) { stop("Zero coding substitutions found in this dataset. Unable to run dndscv.") } - see #L175

ADD COMMENTlink written 7 days ago by zx87546.5k

Thank but says this tool also can handle the Indels :-(

ADD REPLYlink written 7 days ago by F3.3k

An Indel is a Short Nucleotide Variant you are mixing it up with a SNP. Look at the code he provided and Rams answer.

mutations$strand = sapply(RefCDS,function(x) x$strand)[mutations$geneind]
snv = (mutations$ref %in% nt & mutations$mut %in% nt)
if (!any(snv)) { stop("Zero coding substitutions found in this dataset. Unable to run dndscv.") }

The keyword is RefCDS. You have no SNVs in annotated coding regions. Your data is the problem. Check your data why there is no overlap with annotated coding regions.

ADD REPLYlink written 7 days ago by ATpoint13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1778 users visited in the last hour