-
Notifications
You must be signed in to change notification settings - Fork 38
Description
hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?
first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results metadataset_clean and metafeature_clean downloaded from google drive and the provided scripts 1-aggregate-results and 2-performance-rankings.
Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the agg_df_with_default and agg_df using the datasets mentioned in /scripts/HARD_DATASETS_BENCHMARK.sh, before calculating ranks and saving result.
I add a column called dataset_count to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly, catboost, saint and node have exact same time/1000 inst. and nearly same logloss mean, logloss std compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.
I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!
================================================================================
I first add a column called dataset_count and modify the get_rank_table function to calculate total dataset_count by adding a simple line:



