Skip to content

reproducing the Table 5 result of the paper #96

@abvesa

Description

@abvesa

hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?

first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results metadataset_clean and metafeature_clean downloaded from google drive and the provided scripts 1-aggregate-results and 2-performance-rankings.

Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the agg_df_with_default and agg_df using the datasets mentioned in /scripts/HARD_DATASETS_BENCHMARK.sh, before calculating ranks and saving result.

I add a column called dataset_count to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly, catboost, saint and node have exact same time/1000 inst. and nearly same logloss mean, logloss std compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.

I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!

螢幕擷取畫面 2023-12-23 103559
螢幕擷取畫面 2023-12-23 103700

================================================================================
I first add a column called dataset_count and modify the get_rank_table function to calculate total dataset_count by adding a simple line:
螢幕擷取畫面 2023-12-23 110030
螢幕擷取畫面 2023-12-23 105136

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions