► 代码示例 / 计算机视觉 / 视频分类与CNN-RNN架构

视频分类与CNN-RNN架构

作者: Sayak Paul
创建日期: 2021/05/28
最后修改: 2023/12/08
描述: 在UCF101数据集上使用迁移学习和循环模型训练视频分类器。

本示例演示了视频分类，这是一个重要的用例，应用于推荐、安全等。我们将使用UCF101数据集来构建我们的视频分类器。该数据集由不同动作（如击球、打拳、骑自行车等）分类的视频组成。该数据集通常用于构建动作识别器，这是视频分类的一种应用。

视频由有序的帧序列组成。每帧包含空间信息，而这些帧的序列包含时间信息。为了建模这两个方面，我们使用一种混合架构，该架构由卷积（用于空间处理）和递归层（用于时间处理）组成。具体来说，我们将使用卷积神经网络（CNN）和包含GRU层的递归神经网络（RNN）。这种混合架构通常被称为CNN-RNN。

本示例需要TensorFlow 2.5或更高版本，并且可以使用以下命令安装TensorFlow文档：

!pip install -q git+https://github.com/tensorflow/docs

数据收集

为了使本示例的运行时间相对较短，我们将使用原始UCF101数据集的一个子采样版本。您可以参考该笔记本了解子采样是如何完成的。

!!wget -q https://github.com/sayakpaul/Action-Recognition-in-TensorFlow/releases/download/v1.0.0/ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

设置

import os

import keras
from imutils import paths

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import imageio
import cv2
from IPython.display import Image

定义超参数

IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

数据准备

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"训练视频总数: {len(train_df)}")
print(f"测试视频总数: {len(test_df)}")

train_df.sample(10)

训练视频总数: 594
测试视频总数: 224

	video_name	tag
492	v_TennisSwing_g10_c03.avi	TennisSwing
536	v_TennisSwing_g16_c05.avi	TennisSwing
413	v_ShavingBeard_g16_c05.avi	ShavingBeard
268	v_Punch_g12_c04.avi	Punch
288	v_Punch_g15_c03.avi	Punch
30	v_CricketShot_g12_c03.avi	CricketShot
449	v_ShavingBeard_g21_c07.avi	ShavingBeard
524	v_TennisSwing_g14_c07.avi	TennisSwing
145	v_PlayingCello_g12_c01.avi	PlayingCello
566	v_TennisSwing_g21_c03.avi	TennisSwing

训练视频分类器的许多挑战之一是找出将视频输入网络的方法。这篇博客文章讨论了五种此类方法。由于视频是一系列有序的帧，我们可以直接提取帧并将其放入一个3D张量。但每个视频的帧数可能会有所不同，这会阻止我们将它们堆叠成批次（除非使用填充）。作为替代方案，我们可以在达到最大帧数之前以固定间隔保存视频帧。在本例中，我们将执行以下操作：

捕获视频的帧。
从视频中提取帧，直到达到最大帧数。
在视频帧数少于最大帧数的情况下，我们将在视频中填充零。

请注意，这个工作流程与处理文本序列的问题是相同的。UCF101数据集的视频被已知在帧之间的对象和动作上没有极端变化。因此，只考虑少量帧进行学习任务可能是可以的。但是，这种方法可能不能很好地推广到其他视频分类问题。我们将使用OpenCV的 VideoCapture() 方法来读取视频帧。

# 以下两个方法来自于这个教程：
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

我们可以使用预训练的网络从提取的帧中提取有意义的特征。Keras Applications模块提供了一些在ImageNet-1k数据集上预训练的最先进模型。我们将使用InceptionV3模型来实现这个目的。

def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

视频的标签是字符串。神经网络无法理解字符串值，因此在将它们输入到模型之前，必须将它们转换为某种数值形式。在这里，我们将使用StringLookup层将类标签编码为整数。

label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())

['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']

最后，我们可以将所有部分组合在一起，创建我们的数据处理工具。

def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = keras.ops.convert_to_numpy(label_processor(labels[..., None]))

    # `frame_masks` 和 `frame_features` 是我们将输入到序列模型中的内容。
    # `frame_masks` 将包含一系列布尔值，表示某个时间步是否被填充遮罩。
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # 对于每个视频。
    for idx, path in enumerate(video_paths):
        # 收集所有帧并添加一个批量维度。
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # 初始化占位符以存储当前视频的掩码和特征。
        temp_frame_mask = np.zeros(
            shape=(
                1,
                MAX_SEQ_LENGTH,
            ),
            dtype="bool",
        )
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # 从当前视频的帧中提取特征。
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :], verbose=0,
                )
            temp_frame_mask[i, :length] = 1  # 1 = 未遮罩，0 = 被遮罩

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"训练集中的帧特征: {train_data[0].shape}")
print(f"训练集中的帧掩码: {train_data[1].shape}")

训练集中的帧特征: (594, 20, 2048)
训练集中的帧掩码: (594, 20)

上述代码块的执行时间约为20分钟，具体取决于运行机器的性能。

序列模型

现在，我们可以将这些数据输入到由循环层（如GRU）组成的序列模型中。

# 序列模型的工具函数。
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # 请参考以下教程以了解使用`mask`的重要性：
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model


# 运行实验的工具函数。
def run_experiment():
    filepath = "/tmp/video_classifier/ckpt.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"测试准确率: {round(accuracy * 100, 2)}%")

    return history, seq_model


_, sequence_model = run_experiment()

第1轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.3058 - loss: 1.5597 
第1轮: val_loss 从 inf 改善至 1.78077, 正在保存模型至 /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 2s 36ms/step - accuracy: 0.3127 - loss: 1.5531 - val_accuracy: 0.1397 - val_loss: 1.7808
第2轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.5216 - loss: 1.2704
第2轮: val_loss 从 1.78077 改善至 1.78026, 正在保存模型至 /tmp/video_classifier/ckpt.weights.h5
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.5226 - loss: 1.2684 - val_accuracy: 0.1788 - val_loss: 1.7803
第3轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6189 - loss: 1.1656
第3轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6174 - loss: 1.1651 - val_accuracy: 0.2849 - val_loss: 1.8322
第4轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6518 - loss: 1.0645
第4轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.6515 - loss: 1.0647 - val_accuracy: 0.2793 - val_loss: 2.0419
第5轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.6833 - loss: 0.9976
第5轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.6843 - loss: 0.9965 - val_accuracy: 0.3073 - val_loss: 1.9077
第6轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.7229 - loss: 0.9312
第6轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.7241 - loss: 0.9305 - val_accuracy: 0.3017 - val_loss: 2.1513
第7轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8023 - loss: 0.9132
第7轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8035 - loss: 0.9093 - val_accuracy: 0.3184 - val_loss: 2.1705
第8轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8127 - loss: 0.8380
第8轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8128 - loss: 0.8356 - val_accuracy: 0.3296 - val_loss: 2.2043
第9轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8494 - loss: 0.7641
第9轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8494 - loss: 0.7622 - val_accuracy: 0.3017 - val_loss: 2.3734
第10轮/10轮
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - accuracy: 0.8634 - loss: 0.6883
第10轮: val_loss 没有改善至 1.78026
 13/13 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - accuracy: 0.8649 - loss: 0.6882 - val_accuracy: 0.3240 - val_loss: 2.4410
 7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.7816 - loss: 1.0624 
测试准确率: 56.7%

注意：为了使这个例子的运行时间相对较短，我们仅使用了一些训练示例。考虑到使用的序列模型有99,909个可训练参数，训练示例的数量相对较少。建议您采样更多数据。 data from the UCF101 数据集，使用上述提到的 notebook 来训练相同的模型。

推理

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(
        shape=(
            1,
            MAX_SEQ_LENGTH,
        ),
        dtype="bool",
    )
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = 未屏蔽，0 = 已屏蔽

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# 此工具用于可视化。
# 参考来自：
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, duration=100)
    return Image("animation.gif")


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"测试视频路径: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

测试视频路径: v_TennisSwing_g03_c01.avi
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 32ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step
 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 166ms/step
  CricketShot: 46.99%
  ShavingBeard: 18.83%
  TennisSwing: 14.65%
  Punch: 12.41%
  PlayingCello:  7.12%

<IPython.core.display.Image object>

下一步

在这个例子中，我们利用迁移学习从视频帧中提取有意义的特征。你还可以微调预训练的网络，观察这对最终结果的影响。
为了实现速度与准确性之间的权衡，你可以尝试 keras.applications 中的其他模型。
尝试不同的 MAX_SEQ_LENGTH 组合，观察这会如何影响性能。
在更多类别上进行训练，看看是否能够获得良好的性能。
参考本教程，尝试使用来自 DeepMind 的预训练动作识别模型。
滚动平均可以是一种有用的视频分类技术，并可以与标准图像分类模型结合以对视频进行推断。本教程将帮助理解如何将滚动平均与图像分类器结合使用。
当视频的帧之间存在变化时，并不是所有帧在决定其类别时都是同等重要的。在这些情况下，向序列模型中添加自注意力层可能会得到更好的结果。
根据这章书，你可以实现基于 Transformers 的模型来处理视频。

视频分类与CNN-RNN架构

◆ 数据收集

◆ 设置

◆ 定义超参数

◆ 数据准备

◆ 序列模型

◆ 推理

◆ 下一步