diff --git a/website/docs/Use-Cases/Production-Deployment.md b/website/docs/Use-Cases/Production-Deployment.md new file mode 100644 index 0000000000..a02bbf3c55 --- /dev/null +++ b/website/docs/Use-Cases/Production-Deployment.md @@ -0,0 +1,283 @@ +# Production Deployment + +This page walks through the **train → save → reload → predict on new data** lifecycle for FLAML models, with a focus on the gotchas that surface in production but not in the quick-start tutorials. Each section is a self-contained pattern with runnable code and a pointer to the issue or PR that motivated it. + +## Scope + +You have called `AutoML.fit(...)` once on training data and now need to: + +- Serialize the trained model so that a separate process can load and use it. +- Score new (unseen) input rows that may contain categorical features, new categorical values, or a slightly different class distribution. +- Reach into individual ensemble component models (`automl.model.estimators_[i]`). +- Pass sample weights at training time, and understand what `predict()` does (and does not) accept at inference time. +- Avoid the common silent-correctness bugs reported in #1101 (categorical encoding drift) and #1136 (ensemble component prediction). + +What this page does **not** cover: training-time configuration (see [Task-Oriented AutoML](Task-Oriented-AutoML)), zero-shot estimators (see [Zero-Shot AutoML](Zero-Shot-AutoML)), or distributed/Spark deployment. + +## 1. Save and reload the trained model + +### 1.1 `automl.pickle()` — recommended default + +`automl.pickle()` writes the entire `AutoML` instance, including the data transformer, the best estimator, and the search history. `AutoML.load_pickle()` restores it in another process. This is the simplest reliable path for FLAML. + +```python +import numpy as np +import pandas as pd +from flaml import AutoML + +X = pd.DataFrame( + { + "age": np.random.randint(20, 70, 400), + "income": np.random.normal(50000, 15000, 400), + "gender": np.random.choice(["M", "F"], 400), + "education": np.random.choice(["HS", "BS", "MS", "PhD"], 400), + } +) +y = (X["age"] > 40).astype(int) + +automl = AutoML() +automl.fit(X, y, task="classification", time_budget=5, estimator_list=["lgbm"]) +automl.pickle("automl.pkl") + +# In a different process: +loaded = AutoML.load_pickle("automl.pkl") +assert np.array_equal(automl.predict(X), loaded.predict(X)) +``` + +Use `automl.pickle()` whenever possible. It is the only path that preserves *everything* needed at inference time (data transformer included), so the categorical-encoding behavior described in section 3 is reproduced correctly. + +### 1.2 MLflow logging — for MLflow-managed deployments + +If your serving stack is built around MLflow, log the trained `AutoML` instance explicitly via the sklearn flavor. This works because the `AutoML` object exposes a sklearn-compatible `predict`/`predict_proba` API. + +```python +import mlflow +import numpy as np +from sklearn.datasets import load_iris +from sklearn.model_selection import train_test_split +from flaml import AutoML + +X, y = load_iris(return_X_y=True, as_frame=True) +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42 +) + +mlflow.set_experiment("flaml_prod") +automl = AutoML() +with mlflow.start_run() as run: + automl.fit( + X_train, y_train, task="classification", time_budget=5, mlflow_logging=False + ) + mlflow.sklearn.log_model(automl, artifact_path="flaml_model") + run_id = run.info.run_id + +loaded = mlflow.sklearn.load_model(f"runs:/{run_id}/flaml_model") +assert np.array_equal(automl.predict(X_test), loaded.predict(X_test)) +``` + +Two practical notes: + +- `mlflow_logging=False` disables FLAML's built-in MLflow autologging path inside `fit`. With it enabled, MLflow auto-saves an artifact under `runs:/{run_id}/model`, but on recent MLflow versions reloading that artifact via `mlflow.sklearn.load_model` can return an unfitted `Pipeline`. The explicit `mlflow.sklearn.log_model(automl, ...)` call above sidesteps that issue. +- The argument is `artifact_path=` (not `name=`) in MLflow 2.x. + +### 1.3 Pickling just the best estimator — lean serving + +If you do not need the data transformer (because your serving pipeline preprocesses upstream and only needs to call the bare ML model), you can pickle `automl.model` instead of the whole `AutoML`. **Use this only if you can guarantee** that inference-time inputs match what FLAML produced *after* its data transformer ran — otherwise you will hit the categorical and ensemble issues in sections 3 and 4. + +## 2. The public `automl.preprocess(X)` API + +FLAML applies two layers of preprocessing inside `automl.predict(X)`: + +1. **Task-level preprocessing** — handled by the internal `DataTransformer`: type coercions, NaN handling, categorical encoding, datetime expansion. +1. **Estimator-level preprocessing** — handled by the estimator wrapper itself (e.g., `Normalizer` for the `SGDEstimator`, sparse-input conversion for XGBoost). + +Calling `automl.predict(X)` chains both layers automatically. When you need to reach a single ensemble component or write a custom inference pipeline, call them explicitly: + +```python +# Task-level preprocessing, accessible since #1497 +X_pre = automl.preprocess(X_test) + +# Estimator-level preprocessing on top of the task-level output +X_full = automl.model.preprocess(X_pre) +``` + +For most consumers, `automl.preprocess(X_test)` is all you need before delegating to a single estimator. Section 4 walks through the canonical use case. + +## 3. Categorical features at inference time + +This section is the answer to issue #1101 and the silent-correctness bug fixed in PR #1561. + +### 3.1 What FLAML does at fit time + +When `X` is a pandas DataFrame containing `object`, `string`, or `category` columns, `DataTransformer.fit_transform` records the per-column category list seen at fit time and pins it on the transformer instance. Each known category gets a stable integer code; an extra reserved slot is held for the `"__NAN__"` sentinel that future inference batches may need. + +### 3.2 What `transform` does at predict time + +`DataTransformer.transform` re-uses the pinned category list, so the integer code assigned to each known category at predict time is identical to the one assigned at fit time — regardless of which values happen to appear in the predict-time DataFrame. + +```python +import pandas as pd +import numpy as np +from flaml.automl.data import DataTransformer +from flaml.automl.task.factory import task_factory + +rng = np.random.RandomState(0) +fit_df = pd.DataFrame( + { + "a": rng.randn(120), + "gender": rng.choice(["M", "F"], 120), + } +) +fit_y = pd.Series(rng.randn(120)) + +transformer = DataTransformer() +transformer.fit_transform( + fit_df.copy(), fit_y, task_factory("regression", fit_df, fit_y) +) + +# Predict-time DataFrame contains only the "M" category +predict_df = pd.DataFrame({"a": np.zeros(20), "gender": ["M"] * 20}) +X_pred = transformer.transform(predict_df.copy()) + +# The integer code assigned to "M" is the same as at fit time — no drift. +``` + +### 3.3 Unseen categories + +If predict-time data contains values that were not seen at fit time, FLAML now emits a `UserWarning` and encodes those rows as the `"__NAN__"` sentinel. Consume the warning category in your serving code and decide how to react (log, alert, reject the batch, etc.). + +```python +import warnings + +with warnings.catch_warnings(record=True) as caught: + warnings.simplefilter("always") + predict_df = pd.DataFrame({"a": np.zeros(5), "gender": ["M", "F", "X", "Y", "M"]}) + X_pred = transformer.transform(predict_df.copy()) + +unseen = [ + w + for w in caught + if issubclass(w.category, UserWarning) and "unseen at fit time" in str(w.message) +] +if unseen: + # In production this is where you raise an alert / reject the batch / + # fall back to a default category. + print(unseen[0].message) +``` + +The model still produces a prediction for rows mapped to `"__NAN__"`, but those predictions are unreliable: the model was not trained on that category. Treat unseen-category warnings as a deployment health signal, not background noise. + +### 3.4 Recommended workflow + +If your production data may legitimately introduce new categorical values over time (a new product code, a new geography), pin the category list upstream of FLAML using sklearn's `OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)` or an equivalent component, and pass the encoded DataFrame into `AutoML.fit`. This makes encoding consistency an explicit part of your pipeline rather than relying on FLAML's defensive sentinel. + +## 4. Ensemble component access + +This is the canonical pattern for issue #1136 (closed by PR #1558). + +When `AutoML.fit(..., ensemble=True)` is used, `automl.model` is a sklearn `StackingClassifier`/`StackingRegressor` whose `estimators_` were trained on data that has already passed through FLAML's task-level preprocessing. As a result, calling `automl.model.estimators_[i].predict(X_raw)` directly raises a confusing error from the underlying estimator (`LightGBM: train and valid dataset categorical_feature do not match`, `XGBoost: DataFrame.dtypes must be int/float/bool/category`, etc.). + +The fix is to preprocess raw input via `automl.preprocess(X)` first: + +```python +automl = AutoML() +automl.fit( + X, + y, + task="classification", + ensemble=True, + estimator_list=["lgbm", "xgboost", "rf"], + time_budget=10, +) + +# Direct call on raw input — DOES NOT WORK: +# automl.model.estimators_[0].predict(X) # raises ValueError on categorical input + +# Correct pattern — preprocess first: +X_pre = automl.preprocess(X) +component_preds = [est.predict(X_pre) for est in automl.model.estimators_] +``` + +This is intentionally a two-step process. `automl.predict(X)` does both steps for you; component-level access is for cases where you need per-component scores, predictions, or feature attributions, in which case you handle the preprocessing call site explicitly. + +## 5. Sample weights and cost-sensitive learning + +Pass `sample_weight` at fit time to perform cost-sensitive training. FLAML honors the weight inside both the holdout and CV evaluation paths. + +```python +import numpy as np +from flaml import AutoML + +# 5x weight on the minority (positive) class +sample_weight = np.where(y_train == 1, 5.0, 1.0) +automl = AutoML() +automl.fit( + X_train, + y_train, + sample_weight=sample_weight, + task="classification", + time_budget=5, +) +``` + +Compatibility notes: + +- `split_type="time"` + `sample_weight` works correctly after PR #1554 (closes #887). +- `predict()` does not take a `sample_weight` argument — weights apply only during training. For weighted evaluation on new data, compute the metric outside FLAML (e.g., `sklearn.metrics.f1_score(y_test, automl.predict(X_test), sample_weight=test_weight)`). +- `class_weight` is passed through to the underlying estimator unchanged if your chosen estimator accepts it (e.g., LightGBM, XGBoost sklearn API). + +For severe class imbalance, see also [issue #1200](https://github.com/microsoft/FLAML/issues/1200) on adding a `resampler=` integration. The current recommendation is to apply SMOTE (or your resampler of choice) upstream of `AutoML.fit`; see the imbalanced-learn documentation for the canonical pattern. + +## 6. Multi-output regression + +For multi-target regression today, wrap a fresh `AutoML(task="regression", ...)` per target with sklearn's `MultiOutputRegressor` or `RegressorChain`: + +```python +from sklearn.datasets import make_regression +from sklearn.multioutput import MultiOutputRegressor +from flaml import AutoML + +X, y = make_regression(n_samples=200, n_targets=3, random_state=42) +model = MultiOutputRegressor( + AutoML(task="regression", time_budget=1, estimator_list=["lgbm"]) +) +model.fit(X[:150], y[:150]) +preds = model.predict(X[150:]) +``` + +Known limitation: passing `X_val` and `y_val` through `MultiOutputRegressor` does not flow into each inner `AutoML.fit` ([#1115](https://github.com/microsoft/FLAML/issues/1115)). Workaround: concatenate train + val into a single dataset and use a custom splitter, or call `AutoML` per target manually. + +Native multi-target support is being tracked in [#1301](https://github.com/microsoft/FLAML/issues/1301); when it lands, prefer the native path. + +## 7. Versioning and reproducibility + +Two pieces matter for reproducible predictions in production: + +1. **The FLAML `random_seed`** — pass it via `automl.fit(..., seed=N)` to make the search deterministic. The 2026-05 reproducibility audit (closes #1540) standardized how every audited estimator honors this seed; see #1541 (SGD), #1546 (LRL1), #1547 (RandomForest/ExtraTrees), #1549 (XGBoost sklearn), #1551 (XGBoost native), #1552 (LRL2), #1556 (LRL `penalty`/`n_jobs` deprecations). +1. **Pinned library versions** — `flaml`, `scikit-learn`, `lightgbm`, `xgboost`, `catboost`, `pandas`, and `numpy` should all be pinned in your serving environment. Mismatches between training-time and serving-time versions of any of these can produce silently divergent predictions even with the same `random_seed`. + +A minimal training-environment `requirements.txt` snippet: + +```text +flaml==2.6.0 +scikit-learn==1.8.0 +lightgbm>=4.0,<5.0 +xgboost>=2.0,<3.0 +pandas>=2.0,<3.0 +numpy>=1.26,<3.0 +``` + +When you ship a model, ship the corresponding `requirements.txt` (or `conda-lock.yml`) alongside the pickle/MLflow artifact and use the same versions to instantiate the serving environment. + +## 8. Common gotchas — quick reference + +| Symptom | Cause | Fix | +| ------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------- | +| `predict()` on a DataFrame returns different codes than at fit time | Predict-time DataFrame had a different subset of categorical values | Use FLAML ≥ post-#1561; or pin categories upstream via `OrdinalEncoder` | +| `UserWarning: Column '...' contains values unseen at fit time` | New category at inference time | Decide policy: alert, retrain, or fall back to default | +| `automl.model.estimators_[i].predict(X)` raises on categorical input | Component model expects preprocessed input | Call `automl.preprocess(X)` first | +| `MultiOutputRegressor(AutoML(...))` ignores `X_val` | Per-target inner `AutoML.fit` doesn't see validation kwargs | Use a custom splitter on the concatenated dataset | +| `AttributeError: 'AutoMLState' has no attribute 'sample_weight_all'` on `retrain_full=True` | Pre-#1554 bug | Upgrade FLAML past #1554 | +| MLflow autolog'd model loads as an unfitted `Pipeline` | Older example assumed an autolog artifact path that no longer reliably reloads | Use the explicit `mlflow.sklearn.log_model(automl, artifact_path=...)` pattern in §1.2 | + +See also: [Best-Practices](../Best-Practices), [Task-Oriented AutoML](Task-Oriented-AutoML), [FAQ](../FAQ).