TensorFlow 数据集¶

设置¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

什么是 tf.data.Dataset？¶

tf.data.Dataset 是 TensorFlow 和 JAX 机器学习库的运行时数据集格式。它使得从许多不同格式加载数据集并对其应用转换变得简单。Yggdrasil 决策森林（YDF）可以本地使用 tf.data.Dataset。

tf.data.Dataset 不应与 tf.Dataset 混淆，后者是 ML 从业人员数据集的集合。请注意，tf.Dataset 中的一些数据集也可以作为 tf.data.Dataset 使用。

在将 tf.data.Dataset 与 YDF 一起使用时：

确保数据集是有限的，即它不会无限重复。
不要对数据集进行洗牌。
与神经网络不同，数据集的批量大小不会影响 YDF 模型。然而，小批量大小可能会导致 TensorFlow 变得缓慢。因此，建议使用较大的批量大小。例如，1000 是一个好的经验值。

创建 tf.data.Dataset¶

创建 tf.data.Datasets 有多种方法。在这里，我们使用 tf.data.Dataset.from_tensor_slices 将一个 Python 列表数组转换为 tf.data.Dataset。这仅仅是为了示例，因为直接将 NumPy 数组传递给 YDF 更有效。

In [1]:

Copied!

import ydf
import numpy as np
import tensorflow as tf
import ydf
import numpy as np
import tensorflow as tf

2023-11-19 18:08:44.092683: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.143396: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.144583: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-19 18:08:45.101126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

让我们下载一个存储在TFRecord格式的数据集。TFRecord是一种容器格式，通常用于存储序列化的TensorFlow示例原型。TFRecord文件通常使用gzip压缩。当打开一个压缩的TFRecord文件时，您必须指定compression_type以避免出现无效文件错误。

In [3]:

Copied!

!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q

与 pandas.read_csv 不同，当使用 tf.data.Dataset 读取 TFRecord 时，您必须指定要加载的特征。

In [18]:

Copied!





def create_tf_data_dataset(path):
    serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")

    def parse_tf_example(serialized_example):
        """解析二进制序列化的tf.Example。"""
        return tf.io.parse_single_example(
            serialized_example,
            {
                "age": tf.io.FixedLenFeature([], dtype=tf.int64),
                "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
                "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
                "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
                "education": tf.io.FixedLenFeature([], dtype=tf.string),
                "income": tf.io.FixedLenFeature([], dtype=tf.string),
                # 这些只是数据集中可用的一些功能。
            }
        )

    return serialized_examples.map(parse_tf_example)

non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
def create_tf_data_dataset(path):
    serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")

    def parse_tf_example(serialized_example):
        """解析二进制序列化的tf.Example。"""
        return tf.io.parse_single_example(
            serialized_example,
            {
                "age": tf.io.FixedLenFeature([], dtype=tf.int64),
                "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
                "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
                "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
                "education": tf.io.FixedLenFeature([], dtype=tf.string),
                "income": tf.io.FixedLenFeature([], dtype=tf.string),
                # 这些只是数据集中可用的一些功能。
            }
        )

    return serialized_examples.map(parse_tf_example)

non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")

在应用 batch 操作符之前，检查加载的示例会更容易。

In [13]:

Copied!

for example in non_batched_train_ds.take(5):
    print(example)
for example in non_batched_train_ds.take(5):
    print(example)

{'age': <tf.Tensor: shape=(), dtype=int64, numpy=44>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'7th-8th'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=37>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=50>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=20051>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'>50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Self-emp-inc'>}

正如之前提到的，批量大小不会影响模型。1000是一个很好的默认值。

In [19]:

Copied!

train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)

训练模型¶

所有YDF方法（例如，训练、评估、分析）本质上都可以使用tf.data.Dataset。

In [20]:

Copied!

learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)

Warning: Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

WARNING:absl:Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

Train model on 22792 examples
Model trained in 0:00:05.323891

我们可以接着评估模型。

In [21]:

Copied!

evaluation = model.evaluate(test_ds)
evaluation
evaluation = model.evaluate(test_ds)
evaluation

Out[21]:

accuracy:

0.839286

AUC: '>50K' vs others:

0.878923

PR-AUC: '>50K' vs others:

0.744216

loss:

0.34535

num examples:

22792

num examples (weighted):

22792

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	16526	782
>50K	2881	2603