► Keras 3 API 文档 / KerasCV / 模型 / 任务 / CLIP 特征提取器

CLIP 特征提取器

`CLIP` class

keras_cv.models.CLIP(
    embed_dim=512,
    image_resolution=224,
    vision_layers=12,
    vision_width=768,
    vision_patch_size=32,
    context_length=77,
    vocab_size=49408,
    transformer_width=512,
    transformer_heads=8,
    transformer_layers=12,
    **kwargs
)

CLIP implements the Contrastive Language-Image Pretraining (CLIP) architecture, which enables joint learning of visual and textual representations for various downstream tasks. The deafult base model achitecture will be set to clip-vit-base-patch32.

Arguments

embed_dim (int): The dimensionality of the joint embedding space for images and texts.
image_resolution (int): The resolution of the input images (both height and width).
vision_layers (int): The number of layers in the vision (image) encoder. vision_width (int): The width of the hidden layers in the vision encoder.
vision_patch_size (int): The size of each square patch in the input images.
context_length (int): The maximum length of the contextualized text sequences.
vocab_size (int): The size of the vocabulary for tokenization.
transformer_width (int): The width of the hidden layers in the transformer-based text encoder.
transformer_heads (int): The number of attention heads in the transformer-based text encoder.
transformer_layers (int): The number of layers in the transformer-based text encoder.

Example

processor = CLIPProcessor(
    input_resolution=224,
    "path_to_vocab.json",
    "path_to_merges.txt"
)
processed_image = processor.process_images(["cat.jpg"])
tokens = processor(
    ["mountains", "cat on tortoise", "two cats"]
)
model = CLIP.from_preset("clip-vit-base-patch16")
image_logits, text_logits = model(
    {
        "images": processed_image,
        "token_ids": tokens["token_ids"],
        "padding_mask": tokens["padding_mask"],
    }
)

[source]

`from_preset` method

CLIP.from_preset()

Instantiate CLIP model from preset config and weights.

Arguments

preset: string. Must be one of "clip-vit-base-patch16", "clip-vit-base-patch32", "clip-vit-large-patch14", "clip-vit-large-patch14-336". If looking for a preset with pretrained weights, choose one of "clip-vit-base-patch16", "clip-vit-base-patch32", "clip-vit-large-patch14", "clip-vit-large-patch14-336".
load_weights: Whether to load pre-trained weights into model. Defaults to None, which follows whether the preset has pretrained weights available.
input_shape : input shape that will be passed to backbone initialization, Defaults to None.If None, the preset value will be used.

Example

# Load architecture and weights from preset
model = keras_cv.models.CLIP.from_preset(
    "clip-vit-base-patch16",
)

# Load randomly initialized model from preset architecture with weights
model = keras_cv.models.CLIP.from_preset(
    "clip-vit-base-patch16",
    load_weights=False,

Preset name	Parameters	Description
clip-vit-base-patch16	149.62M	The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32	151.28M	The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14	427.62M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336	427.94M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)

[source]

`CLIPAttention` class

keras_cv.models.feature_extractor.CLIPAttention(
    proj_dim, num_heads, num_hidden_layers, dropout=0.0, **kwargs
)

Adapted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py # noqa: E501

[source]

`CLIPEncoder` class

keras_cv.models.feature_extractor.CLIPEncoder(width, num_layers, heads, **kwargs)

这是所有层继承的类.

层是一个可调用对象,它接受一个或多个张量作为输入,并输出一个或多个张量.它涉及计算,定义在call()方法中,以及一个状态（权重变量）.状态可以被创建:

在__init__()中,例如通过self.add_weight();
在可选的build()方法中,该方法由层的第一次__call__()调用,并提供输入的形状,这些形状在初始化时可能未知.

层是递归可组合的:如果你将一个层实例分配为另一个层的属性,外部层将开始跟踪内部层创建的权重.嵌套层应在__init__()方法或build()方法中实例化.

用户只需实例化一个层,然后将其视为可调用对象.

参数: trainable:布尔值,表示层的变量是否应可训练. name:层的字符串名称. dtype:层的计算和权重的数据类型.也可以是keras.DTypePolicy,允许计算和权重数据类型不同.默认为None.None表示使用keras.config.dtype_policy(),这是一个float32策略,除非设置为不同值（通过keras.config.set_dtype_policy()）.

属性: name:层的名称（字符串）. dtype:层权重的数据类型.layer.variable_dtype的别名. variable_dtype:层权重的数据类型. compute_dtype:层的计算数据类型.层会自动将输入转换为此数据类型,这会导致计算和输出也在此数据类型中.当使用混合精度时,如果使用keras.DTypePolicy,这将不同于variable_dtype. trainable_weights:在反向传播中包含的变量列表. non_trainable_weights:不应包含在反向传播中的变量列表. weights:trainable_weights和non_trainable_weights列表的连接（按此顺序）. trainable:层是否应被训练（布尔值）,即其潜在可训练的权重是否应作为layer.trainable_weights的一部分返回. input_spec:可选的（列表）InputSpec对象,指定层可以接受的输入约束.

我们建议Layer的后代实现以下方法:

__init__():定义自定义层属性,并使用add_weight()或其他状态创建不依赖于输入形状的层权重.
build(self, input_shape):此方法可用于创建依赖于输入形状的权重,使用add_weight()或其他状态.__call__()将自动构建层（如果尚未构建）,通过调用build().
call(self, *args, **kwargs):在确保build()已被调用后,在__call__()中调用.call()执行将层应用于输入参数的逻辑.在call()中可以可选使用的两个保留关键字参数是: 1. training（布尔值,调用是否在推理模式或训练模式）. 2. mask（布尔张量编码输入中的掩码时间步,例如在RNN层中使用）. 此方法的典型签名是call(self, inputs),用户可以根据需要添加training和mask.
get_config(self):返回用于初始化此层的配置字典.如果键与__init__()中的参数不同,则覆盖from_config(self).此方法在保存层或包含此层的模型时使用.

示例:

这是一个基本示例:一个包含两个变量w和b的层,返回y = w . x + b.它展示了如何实现build()和call().设置为层属性的变量被跟踪为层的权重（在layer.weights中）.

class SimpleDense(Layer):
    def __init__(self, units=32):
        super().__init__()
        self.units = units

    # 创建层的状态（权重）
    def build(self, input_shape):
        self.kernel = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="glorot_uniform",
            trainable=True,
            name="kernel",
        )
        self.bias = self.add_weight(
            shape=(self.units,),
            initializer="zeros",
            trainable=True,
            name="bias",
        )

    # 定义计算
    def call(self, inputs):
        return ops.matmul(inputs, self.kernel) + self.bias

# 实例化层
linear_layer = SimpleDense(4)

# 这将调用`build(input_shape)`并创建权重
y = linear_layer(ops.ones((2, 2)))
assert len(linear_layer.weights) == 2

# 这些权重是可训练的,所以它们列在`trainable_weights`中
assert len(linear_layer.trainable_weights) == 2

除了通过训练期间反向传播更新的可训练权重外,层还可以有不可训练的权重.这些权重应在call()期间手动更新.以下是一个计算其输入运行总和的层示例:

class ComputeSum(Layer):

  def __init__(self, input_dim):
      super(ComputeSum, self).__init__()
      # 创建一个不可训练的权重
      self.total = self.add_weight(
        shape=(),
        initializer="zeros",
        trainable=False,
        name="total",
      )

  def call(self, inputs):
      self.total.assign(self.total + ops.sum(inputs))
      return self.total

my_sum = ComputeSum(2)
x = ops.ones((2, 2))
y = my_sum(x)

assert my_sum.weights == [my_sum.total]
assert my_sum.non_trainable_weights == [my_sum.total]
assert my_sum.trainable_weights == []

[source]

`CLIPImageEncoder` class

keras_cv.models.feature_extractor.CLIPImageEncoder(
    input_resolution, patch_size, width, num_layers, heads, output_dim, **kwargs
)

一个将层分组到具有训练/推理功能的对象中的模型.

有三种方法可以实例化一个 Model:

使用"Functional API”

你从 Input 开始, 你链式调用层来指定模型的前向传播, 最后,你从输入和输出创建你的模型:

inputs = keras.Input(shape=(37,))
x = keras.layers.Dense(32, activation="relu")(inputs)
outputs = keras.layers.Dense(5, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

注意:仅支持输入张量的字典、列表和元组.不支持嵌套输入（例如列表的列表或字典的字典）.

还可以通过使用中间张量来创建新的 Functional API 模型.这使你可以快速提取模型的子组件.

示例:

inputs = keras.Input(shape=(None, None, 3))
processed = keras.layers.RandomCrop(width=128, height=128)(inputs)
conv = keras.layers.Conv2D(filters=32, kernel_size=3)(processed)
pooling = keras.layers.GlobalAveragePooling2D()(conv)
feature = keras.layers.Dense(10)(pooling)

full_model = keras.Model(inputs, feature)
backbone = keras.Model(processed, conv)
activations = keras.Model(conv, feature)

注意,backbone 和 activations 模型不是用 keras.Input 对象创建的,而是用源自 keras.Input 对象的张量创建的.在底层,这些模型将共享层和权重,因此用户可以训练 full_model,并使用 backbone 或 activations 进行特征提取.模型的输入和输出可以是张量的嵌套结构,创建的模型是标准的 Functional API 模型,支持所有现有的 API.

通过子类化 `Model` 类

在这种情况下,你应该在 __init__() 中定义你的层,并且你应该在 call() 中实现模型的前向传播.

class MyModel(keras.Model):
    def __init__(self):
        super().__init__()
        self.dense1 = keras.layers.Dense(32, activation="relu")
        self.dense2 = keras.layers.Dense(5, activation="softmax")

    def call(self, inputs):
        x = self.dense1(inputs)
        return self.dense2(x)

model = MyModel()

如果你子类化 Model,你可以选择在 call() 中有一个 training 参数（布尔值）,你可以用它来指定训练和推理中的不同行为:

class MyModel(keras.Model):
    def __init__(self):
        super().__init__()
        self.dense1 = keras.layers.Dense(32, activation="relu")
        self.dense2 = keras.layers.Dense(5, activation="softmax")
        self.dropout = keras.layers.Dropout(0.5)

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.dropout(x, training=training)
        return self.dense2(x)

model = MyModel()

模型创建后,你可以使用 model.compile() 配置模型损失和指标,使用 model.fit() 训练模型,或使用 model.predict() 进行预测.

使用 `Sequential` 类

此外,keras.Sequential 是模型的一个特例,其中模型纯粹是单输入、单输出层的堆叠.

model = keras.Sequential([
    keras.Input(shape=(None, None, 3)),
    keras.layers.Conv2D(filters=32, kernel_size=3),
])

[source]

`CLIPProcessor` class

keras_cv.models.feature_extractor.CLIPProcessor(vocabulary, merges, **kwargs)

CLIPProcessor is a utility class that provides functionality for processing texts in the context of the CLIP (Contrastive Language-Image Pretraining) model.

Arguments

input_resolution (int): The resolution of input images.
vocabulary (str): string or dict, maps token to integer ids. If it is a string, it should be the file path to a json file.
merges: string or list, contains the merge rule. If it is a string, it should be the file path to merge rules. The merge rule file should have one merge rule per line.

[source]

`CLIPTextEncoder` class

keras_cv.models.feature_extractor.CLIPTextEncoder(
    transformer_width,
    transformer_layers,
    transformer_heads,
    vocab_size,
    embed_dim,
    context_length,
    **kwargs
)

一个将层分组到具有训练/推理功能的对象中的模型.

有三种方法可以实例化一个 Model:

使用"Functional API”

你从 Input 开始, 你链式调用层来指定模型的前向传播, 最后,你从输入和输出创建你的模型:

inputs = keras.Input(shape=(37,))
x = keras.layers.Dense(32, activation="relu")(inputs)
outputs = keras.layers.Dense(5, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

注意:仅支持输入张量的字典、列表和元组.不支持嵌套输入（例如列表的列表或字典的字典）.

还可以通过使用中间张量来创建新的 Functional API 模型.这使你可以快速提取模型的子组件.

示例:

inputs = keras.Input(shape=(None, None, 3))
processed = keras.layers.RandomCrop(width=128, height=128)(inputs)
conv = keras.layers.Conv2D(filters=32, kernel_size=3)(processed)
pooling = keras.layers.GlobalAveragePooling2D()(conv)
feature = keras.layers.Dense(10)(pooling)

full_model = keras.Model(inputs, feature)
backbone = keras.Model(processed, conv)
activations = keras.Model(conv, feature)

通过子类化 `Model` 类

在这种情况下,你应该在 __init__() 中定义你的层,并且你应该在 call() 中实现模型的前向传播.

class MyModel(keras.Model):
    def __init__(self):
        super().__init__()
        self.dense1 = keras.layers.Dense(32, activation="relu")
        self.dense2 = keras.layers.Dense(5, activation="softmax")

    def call(self, inputs):
        x = self.dense1(inputs)
        return self.dense2(x)

model = MyModel()

如果你子类化 Model,你可以选择在 call() 中有一个 training 参数（布尔值）,你可以用它来指定训练和推理中的不同行为:

class MyModel(keras.Model):
    def __init__(self):
        super().__init__()
        self.dense1 = keras.layers.Dense(32, activation="relu")
        self.dense2 = keras.layers.Dense(5, activation="softmax")
        self.dropout = keras.layers.Dropout(0.5)

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.dropout(x, training=training)
        return self.dense2(x)

model = MyModel()

模型创建后,你可以使用 model.compile() 配置模型损失和指标,使用 model.fit() 训练模型,或使用 model.predict() 进行预测.

使用 `Sequential` 类

此外,keras.Sequential 是模型的一个特例,其中模型纯粹是单输入、单输出层的堆叠.

model = keras.Sequential([
    keras.Input(shape=(None, None, 3)),
    keras.layers.Conv2D(filters=32, kernel_size=3),
])

[source]

`QuickGELU` class

keras_cv.models.feature_extractor.QuickGELU(**kwargs)

这是所有层继承的类.

在__init__()中,例如通过self.add_weight();
在可选的build()方法中,该方法由层的第一次__call__()调用,并提供输入的形状,这些形状在初始化时可能未知.

层是递归可组合的:如果你将一个层实例分配为另一个层的属性,外部层将开始跟踪内部层创建的权重.嵌套层应在__init__()方法或build()方法中实例化.

用户只需实例化一个层,然后将其视为可调用对象.

我们建议Layer的后代实现以下方法:

__init__():定义自定义层属性,并使用add_weight()或其他状态创建不依赖于输入形状的层权重.
build(self, input_shape):此方法可用于创建依赖于输入形状的权重,使用add_weight()或其他状态.__call__()将自动构建层（如果尚未构建）,通过调用build().
call(self, *args, **kwargs):在确保build()已被调用后,在__call__()中调用.call()执行将层应用于输入参数的逻辑.在call()中可以可选使用的两个保留关键字参数是: 1. training（布尔值,调用是否在推理模式或训练模式）. 2. mask（布尔张量编码输入中的掩码时间步,例如在RNN层中使用）. 此方法的典型签名是call(self, inputs),用户可以根据需要添加training和mask.
get_config(self):返回用于初始化此层的配置字典.如果键与__init__()中的参数不同,则覆盖from_config(self).此方法在保存层或包含此层的模型时使用.

示例:

class SimpleDense(Layer):
    def __init__(self, units=32):
        super().__init__()
        self.units = units

    # 创建层的状态（权重）
    def build(self, input_shape):
        self.kernel = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="glorot_uniform",
            trainable=True,
            name="kernel",
        )
        self.bias = self.add_weight(
            shape=(self.units,),
            initializer="zeros",
            trainable=True,
            name="bias",
        )

    # 定义计算
    def call(self, inputs):
        return ops.matmul(inputs, self.kernel) + self.bias

# 实例化层
linear_layer = SimpleDense(4)

# 这将调用`build(input_shape)`并创建权重
y = linear_layer(ops.ones((2, 2)))
assert len(linear_layer.weights) == 2

# 这些权重是可训练的,所以它们列在`trainable_weights`中
assert len(linear_layer.trainable_weights) == 2

除了通过训练期间反向传播更新的可训练权重外,层还可以有不可训练的权重.这些权重应在call()期间手动更新.以下是一个计算其输入运行总和的层示例:

class ComputeSum(Layer):

  def __init__(self, input_dim):
      super(ComputeSum, self).__init__()
      # 创建一个不可训练的权重
      self.total = self.add_weight(
        shape=(),
        initializer="zeros",
        trainable=False,
        name="total",
      )

  def call(self, inputs):
      self.total.assign(self.total + ops.sum(inputs))
      return self.total

my_sum = ComputeSum(2)
x = ops.ones((2, 2))
y = my_sum(x)

assert my_sum.weights == [my_sum.total]
assert my_sum.non_trainable_weights == [my_sum.total]
assert my_sum.trainable_weights == []

[source]

`ResidualAttention` class

keras_cv.models.feature_extractor.ResidualAttention(
    proj_dim, num_heads, num_hidden_layers, **kwargs
)

这是所有层继承的类.

在__init__()中,例如通过self.add_weight();
在可选的build()方法中,该方法由层的第一次__call__()调用,并提供输入的形状,这些形状在初始化时可能未知.

层是递归可组合的:如果你将一个层实例分配为另一个层的属性,外部层将开始跟踪内部层创建的权重.嵌套层应在__init__()方法或build()方法中实例化.

用户只需实例化一个层,然后将其视为可调用对象.

我们建议Layer的后代实现以下方法:

__init__():定义自定义层属性,并使用add_weight()或其他状态创建不依赖于输入形状的层权重.
build(self, input_shape):此方法可用于创建依赖于输入形状的权重,使用add_weight()或其他状态.__call__()将自动构建层（如果尚未构建）,通过调用build().
call(self, *args, **kwargs):在确保build()已被调用后,在__call__()中调用.call()执行将层应用于输入参数的逻辑.在call()中可以可选使用的两个保留关键字参数是: 1. training（布尔值,调用是否在推理模式或训练模式）. 2. mask（布尔张量编码输入中的掩码时间步,例如在RNN层中使用）. 此方法的典型签名是call(self, inputs),用户可以根据需要添加training和mask.
get_config(self):返回用于初始化此层的配置字典.如果键与__init__()中的参数不同,则覆盖from_config(self).此方法在保存层或包含此层的模型时使用.

示例:

class SimpleDense(Layer):
    def __init__(self, units=32):
        super().__init__()
        self.units = units

    # 创建层的状态（权重）
    def build(self, input_shape):
        self.kernel = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="glorot_uniform",
            trainable=True,
            name="kernel",
        )
        self.bias = self.add_weight(
            shape=(self.units,),
            initializer="zeros",
            trainable=True,
            name="bias",
        )

    # 定义计算
    def call(self, inputs):
        return ops.matmul(inputs, self.kernel) + self.bias

# 实例化层
linear_layer = SimpleDense(4)

# 这将调用`build(input_shape)`并创建权重
y = linear_layer(ops.ones((2, 2)))
assert len(linear_layer.weights) == 2

# 这些权重是可训练的,所以它们列在`trainable_weights`中
assert len(linear_layer.trainable_weights) == 2

除了通过训练期间反向传播更新的可训练权重外,层还可以有不可训练的权重.这些权重应在call()期间手动更新.以下是一个计算其输入运行总和的层示例:

class ComputeSum(Layer):

  def __init__(self, input_dim):
      super(ComputeSum, self).__init__()
      # 创建一个不可训练的权重
      self.total = self.add_weight(
        shape=(),
        initializer="zeros",
        trainable=False,
        name="total",
      )

  def call(self, inputs):
      self.total.assign(self.total + ops.sum(inputs))
      return self.total

my_sum = ComputeSum(2)
x = ops.ones((2, 2))
y = my_sum(x)

assert my_sum.weights == [my_sum.total]
assert my_sum.non_trainable_weights == [my_sum.total]
assert my_sum.trainable_weights == []

CLIP 特征提取器

CLIP class

from_preset method

CLIPAttention class

CLIPEncoder class

CLIPImageEncoder class

◆ 使用"Functional API”

◆ 通过子类化 Model 类

◆ 使用 Sequential 类

CLIPProcessor class

CLIPTextEncoder class

◆ 使用"Functional API”

◆ 通过子类化 Model 类

◆ 使用 Sequential 类

QuickGELU class

ResidualAttention class

CLIP 特征提取器

CLIP class

from_preset method

CLIPAttention class

CLIPEncoder class

CLIPImageEncoder class

使用"Functional API”

通过子类化 Model 类

使用 Sequential 类

CLIPProcessor class

CLIPTextEncoder class

使用"Functional API”

通过子类化 Model 类

使用 Sequential 类

QuickGELU class

ResidualAttention class

`CLIP` class

`from_preset` method

`CLIPAttention` class

`CLIPEncoder` class

`CLIPImageEncoder` class

通过子类化 `Model` 类

使用 `Sequential` 类

`CLIPProcessor` class

`CLIPTextEncoder` class

通过子类化 `Model` 类

使用 `Sequential` 类

`QuickGELU` class

`ResidualAttention` class