As described in the title I would like to score or validate some phylogenetic trees we created. To clarify the current state I have to give you some background information:
We are trying to analyse inhomogeneity in cancer samples of not metastasised colon cancer. We have 5 patients with 6 samples by patient (5 cancer samples, 1 blood sample). Sequencing, variation calling and adding additional annotation was done by an external institute using GATK and snpEff. We received .bam, .vcf and .tsv files for every sample, where the .tsv files have pretty much the same information as the .vcf files.
We decided to go with the .tsv files and have done the following steps up to now:
- Filter based on read quality: FILTER="PASS" (bash script)
- Remove common mutations: db_snp.COMMON !=1 (bash script)
- Create a table which file has which mutation (bash scripts)
- combine the following columns into an IDstring for each mutation: CHROM POS ALT
- Create the table/csv: rows=fileIDs; columns=muationIDs;
The entries in this table are binary (the file has the mutation or not), the table looks like this:
mutationID1 mutationID3 mutationID3 patient1file1 1 0 1 patient1file2 1 0 1
- Create phylogenetic trees
- We loaded this .csv file into R and created several phylogenetic trees, some with all samples of all patients, some with all samples of one patient. The R code is in short something like hclust(dist(dataframe, method="euclidean"),method="average");
Now we would like to score these trees to experiment a bit with our filter steps and tree creation methods. Do you have any ideas how to score such trees?
If any further information is needed I'm happy to provide it. I am a student and this is my first post here as well as my first time working with ngs-data, so please bear with me.