Hi all!
I'm doing an attempt at performing a phylogenetic analysis by previously sketching my samples. I'm sure I'm doing something (many things) wrong because I get some odd results. My approach is as follows
1) I concatenate my forward and reverse read by doing: for i in A B C D
do
cat "${i}"_R1.fastq.gz "${i}"_R2.fastq.gz > "${i}"_cat.fastq.gz
done
2) Then I sketch every concatenated pair of reads: for i in A B C D
do
mash sketch -m 2 "${i}"_cat.fastq.gz
done
3) I finally calculate the distances by using: mash dist *_cat.fastq.gz,msh
Here is my first doubt since my output looks like this:
A B 0.0593303 0 168/1000
A C 0.0621044 0 157/1000
A D 0.0677629 0 137/1000
I see no comparison between A/A (which is pretty obvious, I know) but I also don't see a comparison between B/C and C/D
4) I intended to use this matrix to generate a dendrogram. Will running hclust in R do the trick?
Thanks a lot!
EDIT: I figured out how to get what I was looking for in step 3. So basically I kept step 1 and 2 and then used: mash paste merged_sketches *_cat.fastq.gz.msh
My following step was to infer the distances using the merged sketches as the query and reference:
mash dist -t merged_sketches.msh merged_sketches.msh > distances.txt
I'm still struggling to find the correct way to generate a dendrogram. Do you have any suggestions?