I have a data set that lists single-base mutations for a set of samples but I do not have access to the sequencing data. Each mutation in every sample is listed as a separate line in the file along with genomic coordinates, reference allele, affected gene and a number of other variables. All samples are from the same species. I want to run some phylogenetic analyses so I've constructed pseudo-sequences for each sample.
The lengths of the pseudo-sequences are equal to the number of distinct genomic coordinates in my single-base mutation list. If a sample lacks a particular mutation, the reference base occupies that position of the pseudo sequence. My pseudo-sequences file also contains a reference sequence which is composed of the reference base at each genomic coordinate seen in the single-base mutations file.
I want to better understand my options for phylogenetic analysis given that I haven't got biologically real sequences. I understand that some phylogenic tree-building methods compare sequence motifs of varying lengths and others treat each position as being completely independent of all other bases. Furthermore, I know that some methods, such as PHYLIP's DNA Maximum Likelihood (dnaml) require complete sequences despite the fact they treat each base change as an independent event; (in the dnaml's case this has to do with weighing the number of changes against the number of bases that haven't changed).
It seems to me that my best options for comparing these pseudo-sequences are distance-based methods like neighbor-joining of Fitch-Margoliash but as I am relatively new to phylogenetic analysis, I would very much appreciate any input on how I can compare these pseudo-sequences.