In [ ]:
Copied!
pip install ydf -U
pip install ydf -U
什么是 tf.data.Dataset?¶
tf.data.Dataset 是 TensorFlow 和 JAX 机器学习库的运行时数据集格式。它使得从许多不同格式加载数据集并对其应用转换变得简单。Yggdrasil 决策森林(YDF)可以本地使用 tf.data.Dataset。
tf.data.Dataset 不应与 tf.Dataset 混淆,后者是 ML 从业人员数据集的集合。请注意,tf.Dataset 中的一些数据集也可以作为 tf.data.Dataset 使用。
在将 tf.data.Dataset 与 YDF 一起使用时:
- 确保数据集是有限的,即它不会无限重复。
- 不要对数据集进行洗牌。
- 与神经网络不同,数据集的批量大小不会影响 YDF 模型。然而,小批量大小可能会导致 TensorFlow 变得缓慢。因此,建议使用较大的批量大小。例如,1000 是一个好的经验值。
创建 tf.data.Dataset¶
创建 tf.data.Datasets 有多种方法。在这里,我们使用 tf.data.Dataset.from_tensor_slices
将一个 Python 列表数组转换为 tf.data.Dataset。这仅仅是为了示例,因为直接将 NumPy 数组传递给 YDF 更有效。
In [1]:
Copied!
import ydf
import numpy as np
import tensorflow as tf
import ydf
import numpy as np
import tensorflow as tf
2023-11-19 18:08:44.092683: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-11-19 18:08:44.143396: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-11-19 18:08:44.144583: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-11-19 18:08:45.101126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
让我们下载一个存储在TFRecord格式的数据集。TFRecord是一种容器格式,通常用于存储序列化的TensorFlow示例原型。TFRecord文件通常使用gzip压缩。当打开一个压缩的TFRecord文件时,您必须指定compression_type
以避免出现无效文件错误。
In [3]:
Copied!
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
与 pandas.read_csv
不同,当使用 tf.data.Dataset 读取 TFRecord 时,您必须指定要加载的特征。
In [18]:
Copied!
def create_tf_data_dataset(path):
serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")
def parse_tf_example(serialized_example):
"""解析二进制序列化的tf.Example。"""
return tf.io.parse_single_example(
serialized_example,
{
"age": tf.io.FixedLenFeature([], dtype=tf.int64),
"capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
"hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
"workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
"education": tf.io.FixedLenFeature([], dtype=tf.string),
"income": tf.io.FixedLenFeature([], dtype=tf.string),
# 这些只是数据集中可用的一些功能。
}
)
return serialized_examples.map(parse_tf_example)
non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
def create_tf_data_dataset(path):
serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")
def parse_tf_example(serialized_example):
"""解析二进制序列化的tf.Example。"""
return tf.io.parse_single_example(
serialized_example,
{
"age": tf.io.FixedLenFeature([], dtype=tf.int64),
"capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
"hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
"workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
"education": tf.io.FixedLenFeature([], dtype=tf.string),
"income": tf.io.FixedLenFeature([], dtype=tf.string),
# 这些只是数据集中可用的一些功能。
}
)
return serialized_examples.map(parse_tf_example)
non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
在应用 batch
操作符之前,检查加载的示例会更容易。
In [13]:
Copied!
for example in non_batched_train_ds.take(5):
print(example)
for example in non_batched_train_ds.take(5):
print(example)
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=44>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'7th-8th'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=37>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=50>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>} {'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=20051>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'>50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Self-emp-inc'>}
正如之前提到的,批量大小不会影响模型。1000是一个很好的默认值。
In [19]:
Copied!
train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
训练模型¶
所有YDF方法(例如,训练、评估、分析)本质上都可以使用tf.data.Dataset。
In [20]:
Copied!
learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
Warning: Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.
WARNING:absl:Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.
Train model on 22792 examples Model trained in 0:00:05.323891
我们可以接着评估模型。
In [21]:
Copied!
evaluation = model.evaluate(test_ds)
evaluation
evaluation = model.evaluate(test_ds)
evaluation
Out[21]:
accuracy:
0.839286
AUC: '>50K' vs others:
0.878923
PR-AUC: '>50K' vs others:
0.744216
loss:
0.34535
22792
22792
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 16526 | 782 |
>50K | 2881 | 2603 |