Kover - order of metadata
1
0
Entering edit mode
7.5 years ago
hfan22 ▴ 40

Hi Alex,

Does the samples in the metadata should match with the order in the kmer matrix header? Supposedly I have a dumb dataset of 5 species:

kmer matrix header:

kmers t1 t2 t3 t4 t5

Metadata:

t5 0
t1 1
t3 0
t2 1
t4 0

Will this cause any problem?

kover • 1.9k views
ADD COMMENT
0
Entering edit mode
7.5 years ago

No, the order of the metadata is not important. The only thing that matters is that the identifiers are the same. Kover will automatically match the data between the k-mer matrix and the metadata based on the identifiers.

If you are interested, this is handled in https://github.com/aldro61/kover/blob/master/core/kover/dataset/create.py (lines 161 to 170).

Edit:

After looking at hfan22's data, we concluded that this was normal behaviour and that the order of the metadata is not important.

For computational reasons, Kover reorders the learning examples to group them by class (e.g.: 0 0 0 0 1 1 1 1). When the metadata were randomly shuffled, the order of the examples within a class changed. In other words, the first example with label 0 was not the same after shuffling. Therefore, the order of the examples in the resulting Kover dataset was different, resulting in a different random train/test split and thus, slightly different metrics.

ADD COMMENT
0
Entering edit mode

I thought so. But when I rearrange the order of the metadata (my metadata came in random order so I reordered them) the results are consistently different (same results for the same ordering but different for different ordering). I tried 4~5 version of the metadata (only vary by the order) and they all give different answers. I noticed this because I was doing some simulations so I knew what kmers should be picked up. Kover did very well with a real dataset but failed in all of my simulations. Since the metadata in my real dataset was in order, I thought it might somehow be related to the random order of metadata in my simulations. However even if I reordered them, I still could not get the kmers that I defined the phenotype with. Shall I send you my simulated dataset (several Mbs) to play with?

ADD REPLY
0
Entering edit mode

It would definitely help me look into this if you were able to share your data. Would you be able to upload it to a server (e.g.: https://mega.nz/) and share the link?

Also, can you include the kover commands that you are using to create and split the data? Did you set the random seed parameter in the "kover dataset split" command? If not, varying results are to be expected, since the examples in the training and testing set are different each time.

ADD REPLY
0
Entering edit mode

https://mega.nz/#F!n053zISY
No key needed.

kover commands used:
kover dataset create from-tsv --genomic-data kmerMatrix.tsv --phenotype-name "rpoBsimulation" --phenotype-metadata metadata.tsv --output temp.kover
kover dataset split --dataset temp.kover --id temp_split --train-size 0.666 --folds 5 --random-seed 72
kover learn --dataset temp.kover --split temp_split --model-type conjunction disjunction --p 0.1 1.0 10.0 --max-rules 5 --hp-choice cv --n-cpu 10

Yes I did set the random seed and used the same one during the trials.

Please let me know if you have problem accessing the data.

Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 3374 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6