Hi, I intend to run a relatively large-scale genotype-to-phenotype assessment (hundreds of genomes. dozens of phenotypes) and would like to use kover to provide targets for the bases of the phenotypes. My problem is that the dataset covers different species, rather than strains of the same species, so the sequences are much divergent, and so I don't think that running kover the 'default' way would work too well. To overcome that, I was thinking about using the amino acid sequences of proteins identified in each species, but I am not sure if kover would accept that as a valid input to the '--from-contigs' option? Alternatively, would it work fine if I passed the k-mer matrix for amino-acid sequences to '--from-tsv', rather than nucleotides? Thanks in advance for all your help!
The --from-contigs option won't work for amino-acid sequences. In that case, I would recommend precomputing the k-mer matrix and creating the dataset with --from-tsv. The entries of this matrix can really represent the presence/absence of anything, e.g., amino-acid sequences, point mutations, etc. The only thing to watch out for is that the feature identifiers must have the same length (what is under the "kmers" column here http://aldro61.github.io/kover/doc_input_formats.html#k-mer-matrix).
I hope that this helps!