Skip to content

Commit

Permalink
Matthias Feurer: Merge pull request #1232 from openml/develop
Browse files Browse the repository at this point in the history
  • Loading branch information
Github Actions committed Mar 22, 2023
1 parent ace66bb commit f819812
Show file tree
Hide file tree
Showing 280 changed files with 9,022 additions and 19,378 deletions.
2 changes: 1 addition & 1 deletion main/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 884c0728f1dea38019eaffe6df15f82c
config: 977121ba2ad02efffcbb2ee6874bcd8d
tags: 645f666f9bcd5a90fca523b33c5a78b7
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,9 @@
# And we can use the evaluation listing functionality to learn more about
# the evaluations available for the conducted runs:
evaluations = openml.evaluations.list_evaluations(
function="predictive_accuracy", output_format="dataframe", study=study.study_id,
function="predictive_accuracy",
output_format="dataframe",
study=study.study_id,
)
print(evaluations.head())

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset IDs could be used directly to load the dataset and split the data into a training set\nand a test set. However, to be reproducible, we will first obtain the respective tasks from\nOpenML, which define both the target feature and the train/test split.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>It is discouraged to work directly on datasets and only provide dataset IDs in a paper as\n this does not allow reproducibility (unclear splitting). Please do not use datasets but the\n respective tasks as basis for a paper and publish task IDS. This example is only given to\n showcase the use of OpenML-Python for a published paper and as a warning on how not to do it.\n Please check the `OpenML documentation of tasks <https://docs.openml.org/#tasks>`_ if you\n want to learn more about them.</p></div>\n\n"
"The dataset IDs could be used directly to load the dataset and split the data into a training set\nand a test set. However, to be reproducible, we will first obtain the respective tasks from\nOpenML, which define both the target feature and the train/test split.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>It is discouraged to work directly on datasets and only provide dataset IDs in a paper as\n this does not allow reproducibility (unclear splitting). Please do not use datasets but the\n respective tasks as basis for a paper and publish task IDS. This example is only given to\n showcase the use of OpenML-Python for a published paper and as a warning on how not to do it.\n Please check the [OpenML documentation of tasks](https://docs.openml.org/#tasks) if you\n want to learn more about them.</p></div>\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This lists both active and inactive tasks (because of ``status='all'``). Unfortunately,\nthis is necessary as some of the datasets contain issues found after the publication and became\ndeactivated, which also deactivated the tasks on them. More information on active or inactive\ndatasets can be found in the `online docs <https://docs.openml.org/#dataset-status>`_.\n\n"
"This lists both active and inactive tasks (because of ``status='all'``). Unfortunately,\nthis is necessary as some of the datasets contain issues found after the publication and became\ndeactivated, which also deactivated the tasks on them. More information on active or inactive\ndatasets can be found in the [online docs](https://docs.openml.org/#dataset-status).\n\n"
]
},
{
Expand Down Expand Up @@ -89,7 +89,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Openml-python uses the `Python logging module <https://docs.python.org/3/library/logging.html>`_\nto provide users with log messages. Each log message is assigned a level of importance, see\nthe table in Python's logging tutorial\n`here <https://docs.python.org/3/howto/logging.html#when-to-use-logging>`_.\n\nBy default, openml-python will print log messages of level `WARNING` and above to console.\nAll log messages (including `DEBUG` and `INFO`) are also saved in a file, which can be\nfound in your cache directory (see also the\n`sphx_glr_examples_20_basic_introduction_tutorial.py`).\nThese file logs are automatically deleted if needed, and use at most 2MB of space.\n\nIt is possible to configure what log levels to send to console and file.\nWhen downloading a dataset from OpenML, a `DEBUG`-level message is written:\n\n"
"Openml-python uses the [Python logging module](https://docs.python.org/3/library/logging.html)\nto provide users with log messages. Each log message is assigned a level of importance, see\nthe table in Python's logging tutorial\n[here](https://docs.python.org/3/howto/logging.html#when-to-use-logging).\n\nBy default, openml-python will print log messages of level `WARNING` and above to console.\nAll log messages (including `DEBUG` and `INFO`) are also saved in a file, which can be\nfound in your cache directory (see also the\n`sphx_glr_examples_20_basic_introduction_tutorial.py`).\nThese file logs are automatically deleted if needed, and use at most 2MB of space.\n\nIt is possible to configure what log levels to send to console and file.\nWhen downloading a dataset from OpenML, a `DEBUG`-level message is written:\n\n"
]
},
{
Expand Down Expand Up @@ -53,7 +53,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Benchmark studies\nHow to list, download and upload benchmark studies.\nIn contrast to `benchmark suites <https://docs.openml.org/benchmark/#benchmarking-suites>`_ which\nhold a list of tasks, studies hold a list of runs. As runs contain all information on flows and\ntasks, all required information about a study can be retrieved.\n"
"\n# Benchmark studies\nHow to list, download and upload benchmark studies.\nIn contrast to [benchmark suites](https://docs.openml.org/benchmark/#benchmarking-suites) which\nhold a list of tasks, studies hold a list of runs. As runs contain all information on flows and\ntasks, all required information about a study can be retrieved.\n"
]
},
{
Expand Down Expand Up @@ -123,7 +123,7 @@
},
"outputs": [],
"source": [
"evaluations = openml.evaluations.list_evaluations(\n function=\"predictive_accuracy\", output_format=\"dataframe\", study=study.study_id,\n)\nprint(evaluations.head())"
"evaluations = openml.evaluations.list_evaluations(\n function=\"predictive_accuracy\",\n output_format=\"dataframe\",\n study=study.study_id,\n)\nprint(evaluations.head())"
]
},
{
Expand Down Expand Up @@ -190,7 +190,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@
},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n\n\ndef plot_cdf(values, metric=\"predictive_accuracy\"):\n max_val = max(values)\n n, bins, patches = plt.hist(values, density=True, histtype=\"step\", cumulative=True, linewidth=3)\n patches[0].set_xy(patches[0].get_xy()[:-1])\n plt.xlim(max(0, min(values) - 0.1), 1)\n plt.title(\"CDF\")\n plt.xlabel(metric)\n plt.ylabel(\"Likelihood\")\n plt.grid(b=True, which=\"major\", linestyle=\"-\")\n plt.minorticks_on()\n plt.grid(b=True, which=\"minor\", linestyle=\"--\")\n plt.axvline(max_val, linestyle=\"--\", color=\"gray\")\n plt.text(max_val, 0, \"%.3f\" % max_val, fontsize=9)\n plt.show()\n\n\nplot_cdf(evals.value, metric)\n# This CDF plot shows that for the given task, based on the results of the\n# runs uploaded, it is almost certain to achieve an accuracy above 52%, i.e.,\n# with non-zero probability. While the maximum accuracy seen till now is 96.5%."
"from matplotlib import pyplot as plt\n\n\ndef plot_cdf(values, metric=\"predictive_accuracy\"):\n max_val = max(values)\n n, bins, patches = plt.hist(values, density=True, histtype=\"step\", cumulative=True, linewidth=3)\n patches[0].set_xy(patches[0].get_xy()[:-1])\n plt.xlim(max(0, min(values) - 0.1), 1)\n plt.title(\"CDF\")\n plt.xlabel(metric)\n plt.ylabel(\"Likelihood\")\n plt.grid(visible=True, which=\"major\", linestyle=\"-\")\n plt.minorticks_on()\n plt.grid(visible=True, which=\"minor\", linestyle=\"--\")\n plt.axvline(max_val, linestyle=\"--\", color=\"gray\")\n plt.text(max_val, 0, \"%.3f\" % max_val, fontsize=9)\n plt.show()\n\n\nplot_cdf(evals.value, metric)\n# This CDF plot shows that for the given task, based on the results of the\n# runs uploaded, it is almost certain to achieve an accuracy above 52%, i.e.,\n# with non-zero probability. While the maximum accuracy seen till now is 96.5%."
]
},
{
Expand Down Expand Up @@ -154,7 +154,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,18 @@
# easy as you want it to be


cat_imp = make_pipeline(OneHotEncoder(handle_unknown="ignore", sparse=False), TruncatedSVD(),)
cat_imp = make_pipeline(
OneHotEncoder(handle_unknown="ignore", sparse=False),
TruncatedSVD(),
)
cont_imp = SimpleImputer(strategy="median")
ct = ColumnTransformer([("cat", cat_imp, cat), ("cont", cont_imp, cont)])
model_original = Pipeline(steps=[("transform", ct), ("estimator", RandomForestClassifier()),])
model_original = Pipeline(
steps=[
("transform", ct),
("estimator", RandomForestClassifier()),
]
)

# Let's change some hyperparameters. Of course, in any good application we
# would tune them using, e.g., Random Search or Bayesian Optimization, but for
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit f819812

Please sign in to comment.