Question: How to score/validate a phylogenetic tree built upon whole-exon mutation data of different cancer samples?
gravatar for ronald.findling
6.1 years ago by
ronald.findling10 wrote:

As described in the title I would like to score or validate some phylogenetic trees we created. To clarify the current state I have to give you some background information:

We are trying to analyse inhomogeneity in cancer samples of not metastasised colon cancer. We have 5 patients with 6 samples by patient (5 cancer samples, 1 blood sample). Sequencing, variation calling and adding additional annotation was done by an external institute using GATK and snpEff. We received .bam, .vcf and .tsv files for every sample, where the .tsv files have pretty much the same information as the .vcf files.


We decided to go with the .tsv files and have done the following steps up to now:

  1. Filter
    1. Filter based on read quality: FILTER="PASS"  (bash script)
    2. Remove common mutations: db_snp.COMMON !=1 (bash script)
  2. Create a table which file has which mutation (bash scripts)
    1. combine the following columns into an IDstring for each mutation: CHROM POS ALT
    2. Create the table/csv: rows=fileIDs; columns=muationIDs;
      The entries in this table are binary (the file has the mutation or not), the table looks like this:
        mutationID1 mutationID3 mutationID3
      patient1file1 1 0 1
      patient1file2 1 0 1
  3. Create phylogenetic trees
    1. We loaded this .csv file into R and created several phylogenetic trees, some with all samples of all patients, some with all samples of one patient. The R code is in short something like hclust(dist(dataframe, method="euclidean"),method="average");

Now we would like to score these trees to experiment a bit with our filter steps and tree creation methods. Do you have any ideas how to score such trees?

If any further information is needed I'm happy to provide it. I am a student and this is my first post here as well as my first time working with ngs-data, so please bear with me.

ADD COMMENTlink modified 6.1 years ago by Brice Sarver3.6k • written 6.1 years ago by ronald.findling10
gravatar for Brice Sarver
6.1 years ago by
Brice Sarver3.6k
United States
Brice Sarver3.6k wrote:

1. Your tree is not phylogenetic; it's a dendrogram that merely represents clustering.

2. What do you mean by 'scoring' trees? In true phylogenetics, you often compare trees estimated under different models of nucleotide or amino acid sequence evolution or different statistical approaches. By using different distance methods passed to hclust(), you'll get different groupings with the caveat that distances are estimated in different ways.

If you want to truly estimate a phylogenetic tree (i.e., a tree that describes the evolutionary pattern of ancestry), you would be best off correcting genetic distances/estimating under models. If you have variants in a VCF, a first-pass method to look at would be at RAxML: an approximate likelihood approach that handles large datasets well and in a parallel fashion.

Hope this helps.

ADD COMMENTlink written 6.1 years ago by Brice Sarver3.6k

To my understanding if the molecular clock assumption is fulfilled a dendrogram is a simple phylogenetic tree, is this incorrect? I'm aware that the molecular clock assumption might not always be fulfilled in cancer, this was just the best I could come up with at the moment.

I want to try different filter methods and parameter, as well as different hierarchical cluster algorithms. To compare the results a way to determine the chance for each individual tree to be correct / represent the data-set best would come in handy.

Thanks for the tip with RAxML, I will try it out and let you know.

ADD REPLYlink written 6.1 years ago by ronald.findling10

Clustering based on overall similarity says nothing about the evolutionary relationships among the taxa of interest - just how similar they are. These were popular in one subfield of systematics where there was an initial assumption that we can't know anything about the 'actual' evolutionary history of organisms; such trees were called phenograms. A dendrogram simply represents a tree-like branching structure ​sensu lato. 

ADD REPLYlink written 6.1 years ago by Brice Sarver3.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2708 users visited in the last hour