GradientBoostedTreesModel

GradientBoostedTreesModel(raw_model: GenericCCModel)

Bases: DecisionForestModel

A Gradient Boosted Trees model for prediction and inspection.

activation

activation() -> Activation

The model activation function.

add_tree

add_tree(tree: Tree) -> None

Adds a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree`	`Tree`	New tree.	required

analyze

analyze(data: InputDataset, sampling: float = 1.0, num_bins: int = 50, partial_depepence_plot: bool = True, conditional_expectation_plot: bool = True, permutation_variable_importance_rounds: int = 1, num_threads: Optional[int] = None, maximum_duration: Optional[float] = 20) -> Analysis

Analyzes a model on a test dataset.

An analysis contains structual information about the model (e.g., variable importances), and the information about the application of the model on the given dataset (e.g. partial dependence plots).

For a large dataset (many examples and / or features), computing the analysis can take significant time.

While some information might be valid, it is generatly not recommended to analyze a model on its training dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)

# Display the analysis in a notebook.
analysis

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists.	required
`sampling`	`float`	Ratio of examples to use for the analysis. The analysis can be expensive to compute. On large datasets, use a small sampling value e.g. 0.01.	`1.0`
`num_bins`	`int`	Number of bins used to accumulate statistics. A large value increase the resolution of the plots but takes more time to compute.	`50`
`partial_depepence_plot`	`bool`	Compute partial dependency plots a.k.a PDPs. Expensive to compute.	`True`
`conditional_expectation_plot`	`bool`	Compute the conditional expectation plots a.k.a. CEP. Cheap to compute.	`True`
`permutation_variable_importance_rounds`	`int`	If >1, computes permutation variable importances using "permutation_variable_importance_rounds" rounds. The most rounds the more accurate the results. Using a single round is often acceptable i.e. permutation_variable_importance_rounds=1. If permutation_variable_importance_rounds=0, disables the computation of permutation variable importances.	`1`
`num_threads`	`Optional[int]`	Number of threads to use to compute the analysis.	`None`
`maximum_duration`	`Optional[float]`	Maximum duration of the analysis in seconds. Note that the analysis can last a little longer than this value.	`20`

Returns:

Type	Description
`Analysis`	Model analysis.

analyze_prediction

analyze_prediction(single_example: InputDataset) -> PredictionAnalysis

Understands a single prediction of the model.

Note: To explain the model as a whole, use model.analyze instead.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")

# We want to explain the model prediction on the first test example.
selected_example = test_ds.iloc[:1]

analysis = model.analyze_prediction(selected_example, test_ds)

# Display the analysis in a notebook.
analysis

Parameters:

Name	Type	Description	Default
`single_example`	`InputDataset`	Example to explain. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists.	required

Returns:

Type	Description
`PredictionAnalysis`	Prediction explanation.

benchmark

benchmark(ds: InputDataset, benchmark_duration: float = 3, warmup_duration: float = 1, batch_size: int = 100, num_threads: Optional[int] = None) -> BenchmarkInferenceCCResult

Benchmark the inference speed of the model on the given dataset.

This benchmark creates batched predictions on the given dataset using the C++ API of Yggdrasil Decision Forests. Note that inference times using other APIs or on different machines will be different. A serving template for the C++ API can be generated with model.to_cpp().

Parameters:

Name	Type	Description	Default
`ds`	`InputDataset`	Dataset to perform the benchmark on.	required
`benchmark_duration`	`float`	Total duration of the benchmark in seconds. Note that this number is only indicative and the actual duration of the benchmark may be shorter or longer. This parameter must be > 0.	`3`
`warmup_duration`	`float`	Total duration of the warmup runs before the benchmark in seconds. During the warmup phase, the benchmark is run without being timed. This allows warming up caches. The benchmark will always run at least one batch for warmup. This parameter must be > 0. batch_size: Size of batches when feeding examples to the inference engines. The impact of this parameter on the results depends on the architecture running the benchmark (notably, cache sizes). num_threads: Number of threads used for the multi-threaded benchmark. If not specified, the number of threads is set to the number of cpu cores.	`1`

Returns:

Type	Description
`BenchmarkInferenceCCResult`	Benchmark results.

data_spec

data_spec() -> DataSpecification

Returns the data spec used for train the model.

describe

describe(output_format: Literal['auto', 'text', 'notebook', 'html'] = 'auto', full_details: bool = False) -> Union[str, HtmlNotebookDisplay]

Description of the model.

Parameters:

Name	Type	Description	Default
`output_format`	`Literal['auto', 'text', 'notebook', 'html']`	Format of the display: - auto: Use the "notebook" format if executed in an IPython notebook / Colab. Otherwise, use the "text" format. - text: Text description of the model. - html: Html description of the model. - notebook: Html description of the model displayed in a notebook cell.	`'auto'`
`full_details`	`bool`	Should the full model be printed. This can be large.	`False`

Returns:

Type	Description
`Union[str, HtmlNotebookDisplay]`	The model description.

distance

distance(data1: InputDataset, data2: Optional[InputDataset] = None) -> ndarray

Computes the pairwise distance between examples in "data1" and "data2".

If "data2" is not provided, computes the pairwise distance between examples in "data1".

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)

test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.

Different models are free to implement different distances with different definitions. For this reasons, unless indicated by the model, distances from different models cannot be compared.

The distance is not guaranteed to satisfy the triangular inequality property of metric distances.

Not all models can compute distances. In this case, this function will raise an Exception.

Parameters:

Name	Type	Description	Default
`data1`	`InputDataset`	Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.	required
`data2`	`Optional[InputDataset]`	Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.	`None`

Returns:

Type	Description
`ndarray`	Pairwise distance

evaluate

evaluate(data: InputDataset, *, weighted: Optional[bool] = None, task: Optional[Task] = None, label: Optional[str] = None, group: Optional[str] = None, bootstrapping: Union[bool, int] = False, ndcg_truncation: int = 5, evaluation_task: Optional[Task] = None, use_slow_engine: bool = False, num_threads: Optional[int] = None) -> Evaluation

Evaluates the quality of a model on a dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
evaluation = model.evaluates(test_ds)

In a notebook, if a cell returns an evaluation object, this evaluation will be as a rich html with plots:

evaluation = model.evaluate(test_ds)
# If model is an anomaly detection model:
# evaluation = model.evaluate(test_ds,
                              evaluation_task=ydf.Task.CLASSIFICATION)
evaluation

It is possible to evaluate the model differently than it was trained. For example, you can change the label, task and group.

...
# Train a regression model
model = ydf.RandomForestLearner(label="label",
task=ydf.Task.REGRESSION).train(train_ds)

# Evaluate the model as a regressive model
regressive_evaluation = model.evaluates(test_ds)

# Evaluate the model as a ranking model model
regressive_evaluation = model.evaluates(test_ds,
  task=ydf.Task.RANKING, group="group_column")

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists.	required
`weighted`	`Optional[bool]`	If true, the evaluation is weighted according to the training weights. If false, the evaluation is non-weighted. b/351279797: Change default to weights=True.	`None`
`task`	`Optional[Task]`	Override the task of the model during the evaluation. If None (default), the model is evaluated according to its training task.	`None`
`label`	`Optional[str]`	Override the label used to evaluate the model. If None (default), use the model's label.	`None`
`group`	`Optional[str]`	Override the group used to evaluate the model. If None (default), use the model's group. Only used for ranking models.	`None`
`bootstrapping`	`Union[bool, int]`	Controls whether bootstrapping is used to evaluate the confidence intervals and statistical tests (i.e., all the metrics ending with "[B]"). If set to false, bootstrapping is disabled. If set to true, bootstrapping is enabled and 2000 bootstrapping samples are used. If set to an integer, it specifies the number of bootstrapping samples to use. In this case, if the number is less than 100, an error is raised as bootstrapping will not yield useful results.	`False`
`ndcg_truncation`	`int`	Controls at which ranking position the NDCG loss should be truncated. Default to 5. Ignored for non-ranking models.	`5`
`evaluation_task`	`Optional[Task]`	Deprecated. Use `task` instead.	`None`
`use_slow_engine`	`bool`	If true, uses the slow engine for making predictions. The slow engine of YDF is an order of magnitude slower than the other prediction engines. There exist very rare edge cases where predictions with the regular engines fail, e.g., models with a very large number of categorical conditions. It is only in these cases, that users should use the slow engine and report the issue to the YDF developers.	`False`
`num_threads`	`Optional[int]`	Number of threads used to run the model.	`None`

Returns:

Type	Description
`Evaluation`	Model evaluation.

force_engine

force_engine(engine_name: Optional[str]) -> None

Forces the engines used by the model.

If not specified (i.e., None; default value), the fastest compatible engine (i.e., the first value returned from "list_compatible_engines") is used for all model inferences (e.g., model.predict, model.evaluate).

If passing a non-existing or non-compatible engine, the next model inference (e.g., model.predict, model.evaluate) will fail.

Parameters:

Name	Type	Description	Default
`engine_name`	`Optional[str]`	Name of a compatible engine or None to automatically select the fastest engine.	required

get_all_trees

get_all_trees() -> Sequence[Tree]

Returns all the trees in the model.

get_tree

get_tree(tree_idx: int) -> Tree

Gets a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required

Returns:

Type	Description
`Tree`	The tree.

hyperparameter_optimizer_logs

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

Returns the logs of the hyper-parameter tuning.

If the model is not trained with hyper-parameter tuning, returns None.

initial_predictions

initial_predictions() -> NDArray[float]

Returns the model's initial predictions (i.e. the model bias).

input_feature_names

input_feature_names() -> List[str]

Returns the names of the input features.

The features are sorted in increasing order of column_idx.

input_features

input_features() -> Sequence[InputFeature]

Returns the input features of the model.

The features are sorted in increasing order of column_idx.

iter_trees

iter_trees() -> Iterator[Tree]

Returns an iterator over all the trees in the model.

label

label() -> str

Name of the label column.

label_classes

label_classes() -> List[str]

Returns the label classes for a classification model; fails otherwise.

list_compatible_engines

list_compatible_engines() -> Sequence[str]

Lists the inference engines compatible with the model.

The engines are sorted to likely-fastest to likely-slowest.

Returns:

Type	Description
`Sequence[str]`	List of compatible engines.

metadata

metadata() -> ModelMetadata

Metadata associated with the model.

A model's metadata contains information stored with the model that does not influence the model's predictions (e.g. data created). When distributing a model for wide release, it may be useful to clear / modify the model metadata with model.set_metadata(ydf.ModelMetadata()).

Returns:

Type	Description
`ModelMetadata`	The model's metadata.

name

name() -> str

Returns the name of the model type.

num_trees

num_trees()

Returns the number of trees in the decision forest.

num_trees_per_iteration

num_trees_per_iteration() -> int

The number of trees trained per gradient boosting iteration.

plot_tree

plot_tree(tree_idx: int = 0, max_depth: Optional[int] = None, options: Optional[PlotOptions] = None, d3js_url: str = 'https://d3js.org/d3.v6.min.js') -> TreePlot

Plots an interactive HTML rendering of the tree.

Usage example:

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, self.num_trees()).	`0`
`max_depth`	`Optional[int]`	Maximum tree depth of the plot. Set to None for full depth.	`None`
`options`	`Optional[PlotOptions]`	Advanced options for plotting. Set to None for default style.	`None`
`d3js_url`	`str`	URL to load the d3.js library from.	`'https://d3js.org/d3.v6.min.js'`

Returns:

Type	Description
`TreePlot`	In interactive environments, an interactive plot. The HTML source can also
`TreePlot`	be exported to file.

predict

predict(data: InputDataset, *, use_slow_engine=False, num_threads: Optional[int] = None) -> ndarray

Returns the predictions of the model on the given dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists. If the dataset contains the label column, that column is ignored.	required
`use_slow_engine`		If true, uses the slow engine for making predictions. The slow engine of YDF is an order of magnitude slower than the other prediction engines. There exist very rare edge cases where predictions with the regular engines fail, e.g., models with a very large number of categorical conditions. It is only in these cases, that users should use the slow engine and report the issue to the YDF developers.	`False`
`num_threads`	`Optional[int]`	Number of threads used to run the model.	`None`

predict_leaves

predict_leaves(data: InputDataset) -> ndarray

Gets the index of the active leaf in each tree.

The active leaf is the leave that that receive the example during inference.

The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	Dataset.	required

Returns:

Type	Description
`ndarray`	Index of the active leaf for each tree in the model.

print_tree

print_tree(tree_idx: int = 0, max_depth: Optional[int] = 6, file=stdout) -> None

Prints a tree in the terminal.

Usage example:

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, self.num_trees()).	`0`
`max_depth`	`Optional[int]`	Maximum tree depth of the plot. Set to None for full depth.	`6`
`file`		Where to print the tree. By default, prints on the terminal standard output.	`stdout`

remove_tree

remove_tree(tree_idx: int) -> None

Removes a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required

save

save(path, advanced_options=ModelIOOptions()) -> None

Save the model to disk.

YDF uses a proprietary model format for saving models. A model consists of multiple files located in the same directory. A directory should only contain a single YDF model. See advanced_options for more information.

YDF models can also be exported to other formats, see to_tensorflow_saved_model() and to_cpp() for details.

YDF saves some metadata inside the model, see model.metadata() for details. Before distributing a model to the world, consider removing metadata with model.set_metadata(ydf.ModelMetadata()).

Usage example:

import pandas as pd
import ydf

# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner().train(df)

# Save the model to disk
model.save("/models/my_model")

Parameters:

Name	Type	Description	Default
`path`		Path to directory to store the model in.	required
`advanced_options`		Advanced options for saving models.	`ModelIOOptions()`

self_evaluation

self_evaluation() -> Optional[Evaluation]

Returns the model's self-evaluation.

For Gradient Boosted Trees models, the self-evaluation is the evaluation on the validation dataset. Note that the validation dataset is extracted automatically if not explicitly given. If the validation dataset is deactivated, no self-evaluation is computed.

Different models use different methods for self-evaluation. Notably, Random Forests use the last Out-Of-Bag evaluation. Therefore, self-evaluations are not comparable between different model types.

Returns None if no self-evaluation has been computed.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation

serialize

serialize() -> bytes

Serializes a model to a sequence of bytes (i.e. bytes).

A serialized model is equivalent to model saved with model.save. It can possibly contain meta-data related to model training and interpretation. To minimize the size of a serialized model, removes this meta-data by passing the argument pure_serving_model=True to the train method.

Usage example:

import pandas as pd
import ydf

# Create a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)

# Serialize model
# Note: serialized_model is a bytes.
serialized_model = model.serialize()

# Deserialize model
deserialized_model = ydf.deserialize_model(serialized_model)

# Make predictions
model.predict(dataset)
deserialized_model.predict(dataset)

Returns:

Type	Description
`bytes`	The serialized model.

set_initial_predictions

set_initial_predictions(initial_predictions: Sequence[float])

Sets the model's initial predictions (i.e. the model bias).

set_metadata

set_metadata(metadata: ModelMetadata)

Sets the model metadata.

set_node_format

set_node_format(node_format: NodeFormat) -> None

Set the serialization format for the nodes.

Parameters:

Name	Type	Description	Default
`node_format`	`NodeFormat`	Node format to use when saving the model.	required

set_tree

set_tree(tree_idx: int, tree: Tree) -> None

Overrides a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required
`tree`	`Tree`	New tree.	required

task

task() -> Task

Task solved by the model.

to_cpp

to_cpp(key: str = 'my_model') -> str

Generates the code of a .h file to run the model in C++.

How to use this function:

Copy the output of this function in a new .h file. open("model.h", "w").write(model.to_cpp())
If you use Bazel/Blaze, create a rule with the dependencies: //third_party/absl/status:statusor //third_party/absl/strings //external/ydf_cc/yggdrasil_decision_forests/api:serving
In your C++ code, include the .h file and call the model with: // Load the model (to do only once). namespace ydf = yggdrasil_decision_forests; const auto model = ydf::exported_model_123::Load(); // Run the model predictions = model.Predict();
The generated "Predict" function takes no inputs. Instead, it fills the input features with placeholder values. Therefore, you will want to add your input as arguments to the "Predict" function, and use it to populate the "examples->Set..." section accordingly.
(Bonus) You can further optimize the inference speed by pre-allocating and re-using the examples and predictions for each thread running the model.

This documentation is also available in the header of the generated content for more details.

Parameters:

Name	Type	Description	Default
`key`	`str`	Name of the model. Used to define the c++ namespace of the model.	`'my_model'`

Returns:

Type	Description
`str`	String containing an example header for running the model in C++.

to_docker

to_docker(path: str, exist_ok: bool = False) -> None

Exports the model to a Docker endpoint deployable on Cloud.

This function creates a directory containing a Dockerfile, the model and support files.

Usage example:

import ydf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Export the model to a Docker endpoint.
model.to_docker(path="/tmp/my_model")

# Print instructions on how to use the model
!cat /tmp/my_model/readme.md

# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image

# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model

# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.

Parameters:

Name	Type	Description	Default
`path`	`str`	Directory where to create the Docker endpoint	required
`exist_ok`	`bool`	If false (default), fails if the directory already exist. If true, override the directory content if any.	`False`

to_jax_function

to_jax_function(jit: bool = True, apply_activation: bool = True, leaves_as_params: bool = False, compatibility: Union[str, Compatibility] = 'XLA') -> JaxModel

Converts the YDF model into a JAX function.

Usage example:

import ydf
import numpy as np
import jax.numpy as jnp

# Train a model.
model = ydf.GradientBoostedTreesLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Convert model to a JAX function.
jax_model = model.to_jax_function()

# Make predictions with the JAX function.
jax_predictions = jax_model.predict({
    "f1": jnp.array([0, 0.5, 1]),
    "f2": jnp.array([1, 0, 0.5]),
})

TODO: Document the encoder and jax params.

Parameters:

Name	Type	Description	Default
`jit`	`bool`	If true, compiles the function with @jax.jit.	`True`
`apply_activation`	`bool`	Should the activation function, if any, be applied on the model output.	`True`
`leaves_as_params`	`bool`	If true, exports the leaf values as learnable parameters. In this case, `params` is set in the returned value, and it should be passed to `predict(feature_values, params)`.	`False`
`compatibility`	`Union[str, Compatibility]`	Constraint on the YDF->JAX conversion to runtime compatibility. Can be "XLA" (default), and "TFL" (for TensorFlow Lite).	`'XLA'`

Returns:

Type	Description
`JaxModel`	A dataclass containing the JAX prediction function (`predict`) and
`JaxModel`	optionnaly the model parameteres (`params`) and feature encoder
`JaxModel`	(`encoder`).

to_tensorflow_function

to_tensorflow_function(temp_dir: Optional[str] = None, can_be_saved: bool = True, squeeze_binary_classification: bool = True, force: bool = False) -> Module

Converts the YDF model into a @tf.function callable TensorFlow Module.

The output module can be composed with other TensorFlow operations, including other models serialized with to_tensorflow_function.

This function requires TensorFlow and TensorFlow Decision Forests to be installed. You can install them using the command pip install tensorflow_decision_forests. The generated SavedModel model relies on the TensorFlow Decision Forests Custom Inference Op. This Op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install tensorflow_decision_forests

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Convert model to a TF module.
tf_model = model.to_tensorflow_function()

# Make predictions with the TF module.
tf_predictions = tf_model({
    "f1": tf.constant([0, 0.5, 1]),
    "f2": tf.constant([1, 0, 0.5]),
})

Parameters:

Name	Type	Description	Default
`temp_dir`	`Optional[str]`	Temporary directory used during the conversion. If None (default), uses `tempfile.mkdtemp` default temporary directory.	`None`
`can_be_saved`	`bool`	If can_be_saved = True (default), the returned module can be saved using `tf.saved_model.save`. In this case, files created in temporary directory during the conversion are not removed when `to_tensorflow_function` exit, and those files should still be present when calling `tf.saved_model.save`. If can_be_saved = False, the files created in the temporary directory during conversion are immediately removed, and the returned object cannot be serialized with `tf.saved_model.save`.	`True`
`squeeze_binary_classification`	`bool`	If true (default), in case of binary classification, outputs a tensor of shape [num examples] containing the probability of the positive class. If false, in case of binary classification, outputs a tensorflow of shape [num examples, 2] containing the probability of both the negative and positive classes. Has no effect on non-binary classification models.	`True`
`force`	`bool`	Try to export even in currently unsupported environments.	`False`

Returns:

Type	Description
`Module`	A TensorFlow @tf.function.

to_tensorflow_saved_model

to_tensorflow_saved_model(path: str, input_model_signature_fn: Any = None, *, mode: Literal['keras', 'tf'] = 'keras', feature_dtypes: Dict[str, TFDType] = {}, servo_api: bool = False, feed_example_proto: bool = False, pre_processing: Optional[Callable] = None, post_processing: Optional[Callable] = None, temp_dir: Optional[str] = None, tensor_specs: Optional[Dict[str, Any]] = None, feature_specs: Optional[Dict[str, Any]] = None, force: bool = False) -> None

Exports the model as a TensorFlow Saved model.

This function requires TensorFlow and TensorFlow Decision Forests to be installed. Install them by running the command pip install tensorflow_decision_forests. The generated SavedModel model relies on the TensorFlow Decision Forests Custom Inference Op. This Op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install tensorflow_decision_forests

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100).astype(dtype=np.float32),
    "l": np.random.randint(2, size=100),
})

# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")

# The model can also be loaded in TensorFlow and executed locally.

# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")

# Make predictions
tf_predictions = tf_model({
    "f1": tf.constant(np.random.random(size=10)),
    "f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})

TensorFlow SavedModel do not cast automatically feature values. For instance, a model trained with a dtype=float32 semantic=numerical feature, will require for this feature to be fed as float32 numbers during inference. You can override the dtype of a feature with the feature_dtypes argument:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    # "f1" is fed as an tf.int64 instead of tf.float64
    feature_dtypes={"f1": tf.int64},
)

Some TensorFlow Serving or Servomatic pipelines rely on feed examples as serialized TensorFlow Example proto (instead of raw tensor values) and/or wrap the model raw output (e.g. probability predictions) into a special structure (called the Serving API). You can create models compatible with those two convensions with feed_example_proto=True and servo_api=True respectively:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    servo_api=True
)

If your model requires some data preprocessing or post-processing, you can express them as a @tf.function or a tf module and pass them to the pre_processing and post_processing arguments respectively.

Warning: When exporting a SavedModel, YDF infers the model signature using the dtype of the features observed during training. If the signature of the pre_processing function is different than the signature of the model (e.g., the processing creates a new feature), you need to specify the tensor specs (tensor_specs; if feed_example_proto=False) or feature spec (feature_specs; if feed_example_proto=True) argument:

# Define a pre-processing function
@tf.function
def pre_processing(raw_features):
  features = {**raw_features}
  # Create a new feature.
  features["sin_f1"] = tf.sin(features["f1"])
  # Remove a feature
  del features["f1"]
  return features

# Create Numpy dataset
raw_dataset = {
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
}

# Apply the preprocessing on the training dataset.
processed_dataset = (
    tf.data.Dataset.from_tensor_slices(raw_dataset)
    .batch(128)  # The batch size has no impact on the model.
    .map(preprocessing)
    .prefetch(tf.data.AUTOTUNE)
)

# Train a model on the pre-processed dataset.
ydf_model = specialized_learners.RandomForestLearner(
    label="l",
    task=generic_learner.Task.CLASSIFICATION,
).train(processed_dataset)

# Export the model to a raw SavedModel model with the pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=False,
    pre_processing=pre_processing,
    tensor_specs{
        "f1": tf.TensorSpec(shape=[None], name="f1", dtype=tf.float64),
        "f2": tf.TensorSpec(shape=[None], name="f2", dtype=tf.float64),
    }
)

# Export the model to a SavedModel consuming serialized tf examples with the
# pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    pre_processing=pre_processing,
    feature_specs={
        "f1": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
        "f2": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
    }
)

For more flexibility, use the method to_tensorflow_function instead of to_tensorflow_saved_model.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to store the Tensorflow Decision Forests model.	required
`input_model_signature_fn`	`Any`	A lambda that returns the (Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g. dictionary, list) corresponding to input signature of the model. If not specified, the input model signature is created by `tfdf.keras.build_default_input_model_signature`. For example, specify `input_model_signature_fn` if an numerical input feature (which is consumed as DenseTensorSpec(float32) by default) will be feed differently (e.g. RaggedTensor(int64)). Only compatible with mode="keras".	`None`
`mode`	`Literal['keras', 'tf']`	How is the YDF converted into a TensorFlow SavedModel. 1) mode = "keras" (default): Turn the model into a Keras 2 model using TensorFlow Decision Forests, and then save it with `tf_keras.models.save_model`. 2) mode = "tf" (recommended; will become default): Turn the model into a TensorFlow Module, and save it with `tf.saved_model.save`.	`'keras'`
`feature_dtypes`	`Dict[str, TFDType]`	Mapping from feature name to TensorFlow dtype. Use this mapping to feature dtype. For instance, numerical features are encoded with tf.float32 by default. If you plan on feeding tf.float64 or tf.int32, use `feature_dtype` to specify it. `feature_dtypes` is ignored if `tensor_specs` is set. If set, disables the automatic signature extraction on `pre_processing` (if `pre_processing` is also set). Only compatible with mode="tf".	`{}`
`servo_api`	`bool`	If true, adds a SavedModel signature to make the model compatible with the `Classify` or `Regress` servo APIs. Only compatible with mode="tf". If false, outputs the raw model predictions.	`False`
`feed_example_proto`	`bool`	If false, the model expects for the input features to be provided as TensorFlow values. This is most efficient way to make predictions. If true, the model expects for the input featurs to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines.	`False`
`pre_processing`	`Optional[Callable]`	Optional TensorFlow function or module to apply on the input features before applying the model. If the `pre_processing` function has been traced (i.e., the function has been called once with actual data and contains a concrete instance in its cache), this signature is extracted and used as signature of the SavedModel. Only compatible with mode="tf".	`None`
`post_processing`	`Optional[Callable]`	Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf".	`None`
`temp_dir`	`Optional[str]`	Temporary directory used during the conversion. If None (default), uses `tempfile.mkdtemp` default temporary directory.	`None`
`tensor_specs`	`Optional[Dict[str, Any]]`	Optional dictionary of `tf.TensorSpec` that define the input features of the model to export. If not provided, the TensorSpecs are automatically generated based on the model features seen during training. This means that "tensor_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with `feed_example_proto=True` as in this case, the TensorSpecs are defined by the `tf.io.parse_example` parsing feature specs. Only compatible with mode="tf".	`None`
`feature_specs`	`Optional[Dict[str, Any]]`	Optional dictionary of `tf.io.parse_example` parsing feature specs e.g. `tf.io.FixedLenFeature` or `tf.io.RaggedFeature`. If not provided, the praising feature specs are automatically generated based on the model features seen during training. This means that "feature_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with `feed_example_proto=False`. Only compatible with mode="tf".	`None`
`force`	`bool`	Try to export even in currently unsupported environments. WARNING: Setting this to true may crash the Python runtime.	`False`

update_with_jax_params

update_with_jax_params(params: Dict[str, Any])

Updates the model with JAX params as created by to_jax_function.

Usage example:

import ydf
import numpy as np
import jax.numpy as jnp

# Train a model with YDF
dataset = {
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
}
model = ydf.GradientBoostedTreesLearner(label="l").train(dataset)

# Convert model to a JAX function with leave values as parameters.
jax_model = model.to_jax_function(
    leaves_as_params=True,
    apply_activation=True)
# Note: The learnable model parameter are in `jax_model.params`.

# Finetune the model parameters with your own logic.
jax_model.params = fine_tune_model(jax_model.params, ...)

# Update the YDF model with the finetuned parameters
model.update_with_jax_params(jax_model.params)

# Make predictions with the finetuned YDF model
predictions = model.predict(dataset)

# Save the YDF model
model.save("/tmp/my_ydf_model")

Parameters:

Name	Type	Description	Default
`params`	`Dict[str, Any]`	Learnable parameter of the model generated with `to_jax_function`.	required

validation_evaluation

validation_evaluation() -> Optional[Evaluation]

Returns the validation evaluation of the model, if available.

Gradient Boosted Trees use a validation dataset for early stopping.

Returns None if no validation evaluation been computed or it has been removed from the model.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

validation_evaluation = model.validation_evaluation()
# In an interactive Python environment, print a rich evaluation report.
validation_evaluation

validation_loss

validation_loss() -> Optional[float]

Returns loss on the validation dataset if available.

variable_importances

variable_importances() -> Dict[str, List[Tuple[float, str]]]

Variable importances to measure the impact of features on the model.

Variable importances generally indicates how much a variable (feature) contributes to the model predictions or quality. Different Variable importances have different semantics and are generally not comparable.

The variable importances returned by variable_importances() depends on the learning algorithm and its hyper-parameters. For example, the hyperparameter compute_oob_variable_importances=True of the Random Forest learner enables the computation of permutation out-of-bag variable importances.

TODO: Add variable importances to documentation.

Features are sorted by decreasing importance.

Usage example:

# Train a Random Forest. Enable the computation of OOB (out-of-bag) variable
# importances.
model = ydf.RandomForestModel(compute_oob_variable_importances=True,
                              label=...).train(ds)
# List the available variable importances.
print(model.variable_importances().keys())

# Show a specific variable importance.
model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
>> [("bill_length_mm", 0.0713061951754389),
    ("island", 0.007298519736842035),
    ("flipper_length_mm", 0.004505893640351366),
...

Returns:

Type	Description
`Dict[str, List[Tuple[float, str]]]`	Variable importances.