Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions openml/_api/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from openml._api.runtime.core import APIContext


def set_api_version(version: str, *, strict: bool = False) -> None:
api_context.set_version(version=version, strict=strict)


api_context = APIContext()
Comment on lines +1 to +8
Copy link
Collaborator

@PGijsbers PGijsbers Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what the function of the APIContext is here. Why do we need it and cannot just use backend directly? E.g.:

Suggested change
from openml._api.runtime.core import APIContext
def set_api_version(version: str, *, strict: bool = False) -> None:
api_context.set_version(version=version, strict=strict)
api_context = APIContext()
from openml._api.runtime.core import build_backend
_backend = build_backend("v1", strict=False)
def set_api_version(version: str, *, strict: bool = False) -> None:
global _backend
_backend = build_backend(version=version, strict=strict)
def backend() -> APIBackend:
return _backend

If it is just to avoid the pitfall where users assign the returned value to a local variable with a scope that is too long lived, then the same would apply if users would assign api_context.backend to a variable. We could instead extend the APIBackend class to allow updates to its attributes?

62 changes: 62 additions & 0 deletions openml/_api/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from __future__ import annotations

from dataclasses import dataclass
from typing import Literal

DelayMethod = Literal["human", "robot"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should be an enum.



@dataclass
class APIConfig:
server: str
base_url: str
key: str
timeout: int = 10 # seconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a unit suffix (timeout_seconds) so the unit is clear without navigating to the source.

ps. I also considered typing it as datetime.timedelta but considering you probably only use it in seconds and there is a real risk of developers erroneously using datetime.timedelta.seconds instead of datetime.timedelta.total_seconds(), I think keeping it an integer is better.



@dataclass
class APISettings:
v1: APIConfig
v2: APIConfig


@dataclass
class ConnectionConfig:
retries: int = 3
delay_method: DelayMethod = "human"
delay_time: int = 1 # seconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here too, including the unit makes sense (delay_time_seconds)


def __post_init__(self) -> None:
if self.delay_method not in ("human", "robot"):
raise ValueError(f"delay_method must be 'human' or 'robot', got {self.delay_method}")


@dataclass
class CacheConfig:
dir: str = "~/.openml/cache"
Copy link
Collaborator

@PGijsbers PGijsbers Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should continue to respect XDG_CACHE_HOME.

ttl: int = 60 * 60 * 24 * 7 # one week
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Considering the TTL of the HTTP standard is already defined in seconds, maybe it is fine to exclude it in the variable name? Though as noted above there is a discussion to be had about having this as a cache level property in the first place.
For future reference, setting the value to timedelta(weeks=1).total_seconds() is preferred over the arithmetic+comment.



@dataclass
class Settings:
api: APISettings
connection: ConnectionConfig
cache: CacheConfig


settings = Settings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the settings to the individual classes. I think this design introduces too high coupling of the classes to this file. You cannot move the classes around, or add a new API version without making non-extensible changes to this file here - because APISettings will require a constructor change and new classes it accepts.

Instead, a better design is to apply the strategy pattern cleanly to the different API definitions - v1 and v2 - and move the config either to their __init__, or a set_config (or similar) method.

api=APISettings(
v1=APIConfig(
server="https://www.openml.org/",
base_url="api/v1/xml/",
key="...",
),
v2=APIConfig(
server="http://127.0.0.1:8001/",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be hardcoded? I guess this is just for your local development

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hard-coded, they are the default values though the local endpoints will be replaced by remote server when deployed hopefully before merging this in main

base_url="",
key="...",
),
),
connection=ConnectionConfig(),
cache=CacheConfig(),
)
3 changes: 3 additions & 0 deletions openml/_api/http/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from openml._api.http.client import HTTPClient

__all__ = ["HTTPClient"]
151 changes: 151 additions & 0 deletions openml/_api/http/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
from __future__ import annotations

from pathlib import Path
from typing import TYPE_CHECKING, Any
from urllib.parse import urlencode, urljoin, urlparse

import requests
from requests import Response

from openml.__version__ import __version__
from openml._api.config import settings

if TYPE_CHECKING:
from openml._api.config import APIConfig


class CacheMixin:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ttl should probably heavily depend on the path. If we do end up using caching at this level, we should use the Cache-Control HTTP Header response so the server can inform us how long to keep it in cache for (something that, I believe, neither servers do right now). A dataset search query can change if any dataset description changes (to either be now included or excluded), so caching probably shouldn't even be on by default for such type of queries. Dataset descriptions might change, but likely not very frequently. Dataset data files or computed qualities should (almost?) never change. This is the reason that the current implementation only caches description, features, qualities, and the dataset itself.

With this implementation, you also introduce some new issues:

  • What if the paths change, or even the query parameters? there is now dead cache. Do we now add cache cleanup routines? How does openml-python know what is no longer valid if they were responses with high TTL?
  • URLs may be (much) longer than the default max path of Windows (260 characters). If I'm not mistaken, this will lead to an issue unless you specifically work around it.
  • More of an implementation detail, but authenticated and unauthenticated requests are not differentiated. If a user accidentally makes an unauthenticated request, gets an error, and then authenticates they would still get an error.

@property
def dir(self) -> str:
return settings.cache.dir

@property
def ttl(self) -> int:
return settings.cache.ttl

def _get_cache_dir(self, url: str, params: dict[str, Any]) -> Path:
parsed_url = urlparse(url)
netloc_parts = parsed_url.netloc.split(".")[::-1] # reverse domain
path_parts = parsed_url.path.strip("/").split("/")

# remove api_key and serialize params if any
filtered_params = {k: v for k, v in params.items() if k != "api_key"}
params_part = [urlencode(filtered_params)] if filtered_params else []

return Path(self.dir).joinpath(*netloc_parts, *path_parts, *params_part)

def _get_cache_response(self, cache_dir: Path) -> Response: # noqa: ARG002
return Response()

def _set_cache_response(self, cache_dir: Path, response: Response) -> None: # noqa: ARG002
return None


class HTTPClient(CacheMixin):
def __init__(self, config: APIConfig) -> None:
self.config = config
self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"}

@property
def server(self) -> str:
return self.config.server

@property
def base_url(self) -> str:
return self.config.base_url

@property
def key(self) -> str:
return self.config.key

@property
def timeout(self) -> int:
return self.config.timeout

def request(
self,
method: str,
path: str,
*,
use_cache: bool = False,
use_api_key: bool = False,
**request_kwargs: Any,
) -> Response:
url = urljoin(self.server, urljoin(self.base_url, path))

params = request_kwargs.pop("params", {})
params = params.copy()
if use_api_key:
params["api_key"] = self.key

headers = request_kwargs.pop("headers", {})
headers = headers.copy()
headers.update(self.headers)

timeout = request_kwargs.pop("timeout", self.timeout)
cache_dir = self._get_cache_dir(url, params)

if use_cache:
try:
return self._get_cache_response(cache_dir)
# TODO: handle ttl expired error
Copy link
Collaborator

@SimonBlanke SimonBlanke Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is out of draft, but this caching is not implemented. I guess this is out of scope for this PR.

Copy link
Contributor Author

@geetu040 geetu040 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the PR is currently a draft, should I mark it with draft as well? There are a bunch of work items that I'll separate if they can worked without affecting derived PRs, otherwise implement myself. For caching specifically I plan to implement it myself otherwise stacking is going to be challenging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general a draft marks a PR where changes are not finalized. I do it the following way: If a PR is not finished and needs developer input for implementation I open a draft. If I think the PR is finished and can be merged, then I change it to a normal PR. But I cannot find an explanation that matches my procedure, so I guess this is just my practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll put this as draft since that doesn't hurt it from getting reviewed or other people working on top of it.

except Exception:
raise

response = requests.request(
method=method,
url=url,
params=params,
headers=headers,
timeout=timeout,
**request_kwargs,
)

if use_cache:
self._set_cache_response(cache_dir, response)

return response

def get(
self,
path: str,
*,
use_cache: bool = False,
use_api_key: bool = False,
**request_kwargs: Any,
) -> Response:
# TODO: remove override when cache is implemented
use_cache = False
return self.request(
method="GET",
path=path,
use_cache=use_cache,
use_api_key=use_api_key,
**request_kwargs,
)

def post(
self,
path: str,
**request_kwargs: Any,
) -> Response:
return self.request(
method="POST",
path=path,
use_cache=False,
use_api_key=True,
**request_kwargs,
)

def delete(
self,
path: str,
**request_kwargs: Any,
) -> Response:
return self.request(
method="DELETE",
path=path,
use_cache=False,
use_api_key=True,
**request_kwargs,
)
Empty file added openml/_api/http/utils.py
Empty file.
4 changes: 4 additions & 0 deletions openml/_api/resources/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from openml._api.resources.datasets import DatasetsV1, DatasetsV2
from openml._api.resources.tasks import TasksV1, TasksV2

__all__ = ["DatasetsV1", "DatasetsV2", "TasksV1", "TasksV2"]
31 changes: 31 additions & 0 deletions openml/_api/resources/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from __future__ import annotations

from abc import ABC, abstractmethod
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from requests import Response

from openml._api.http import HTTPClient
from openml.datasets.dataset import OpenMLDataset
from openml.tasks.task import OpenMLTask


class ResourceAPI:
def __init__(self, http: HTTPClient):
self._http = http


class DatasetsAPI(ResourceAPI, ABC):
@abstractmethod
def get(self, dataset_id: int) -> OpenMLDataset | tuple[OpenMLDataset, Response]: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API design perspective, I am not sure what the usecase for the user is to want access to the requests.Response. The only case I can think of is to parse the data itself but if the user wants to do that I reckon we failed in our API design? I will be the first to admit that error handling needs to be improved (both on the server and client side), but I don't think this makes sense. Am I missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment of course applies not just to the DatasetsAPI, but all of the resource apis



class TasksAPI(ResourceAPI, ABC):
@abstractmethod
def get(
self,
task_id: int,
*,
return_response: bool = False,
) -> OpenMLTask | tuple[OpenMLTask, Response]: ...
20 changes: 20 additions & 0 deletions openml/_api/resources/datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from __future__ import annotations

from typing import TYPE_CHECKING

from openml._api.resources.base import DatasetsAPI

if TYPE_CHECKING:
from responses import Response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production this would be requests, right? You used responses for the mocking here during development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this should should be requests, I'll fix it.


from openml.datasets.dataset import OpenMLDataset


class DatasetsV1(DatasetsAPI):
def get(self, dataset_id: int) -> OpenMLDataset | tuple[OpenMLDataset, Response]:
raise NotImplementedError


class DatasetsV2(DatasetsAPI):
def get(self, dataset_id: int) -> OpenMLDataset | tuple[OpenMLDataset, Response]:
raise NotImplementedError
Loading