► Keras 3 API 文档 / KerasCV / KerasCV 模型

KerasCV 模型

KerasCV 包含流行模型架构的端到端实现。这些模型可以通过两种方式创建：

通过 from_preset() 构造函数，它使用预训练配置（可选）和权重实例化一个对象。可用的预设名称列在本页上。

model = keras_cv.models.RetinaNet.from_preset(
    "resnet50_v2_imagenet",
    num_classes=20,
    bounding_box_format="xywh",
)

通过用户控制的自定义配置。要做到这一点，只需将所需的配置参数传递给下面文档中符号的默认构造函数。

backbone = keras_cv.models.ResNetBackbone(
    stackwise_filters=[64, 128, 256, 512],
    stackwise_blocks=[2, 2, 2, 2],
    stackwise_strides=[1, 2, 2, 2],
    include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
    backbone=backbone,
    num_classes=20,
    bounding_box_format="xywh",
)

主干预设

以下每个预设名称对应于主干模型的配置和权重。

下面的名称可以与 from_preset() 构造函数一起使用，以获取相应的主干模型。

backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")

为了简洁，我们在以下表中不包含没有预训练权重的预设。

注意：如果 include_rescaling=True，则所有预训练权重应与未归一化的像素强度一起使用，范围为 [0, 255]；如果 including_rescaling=False，则范围为 [0, 1]。

Preset name	Model	Parameters	Description
csp_darknet_l_imagenet	CSPDarkNet	27.11M	CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny_imagenet	CSPDarkNet	2.38M	CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
densenet121_imagenet	Unknown	Unknown	DenseNet model with 121 layers. Trained on Imagenet 2012 classification task.
densenet169_imagenet	Unknown	Unknown	DenseNet model with 169 layers. Trained on Imagenet 2012 classification task.
densenet201_imagenet	Unknown	Unknown	DenseNet model with 201 layers. Trained on Imagenet 2012 classification task.
efficientnetv2_b0_imagenet	EfficientNetV2	5.92M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.0`. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet	EfficientNetV2	6.93M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.1`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet	EfficientNetV2	8.77M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.1` and `depth_coefficient=1.2`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
efficientnetv2_s_imagenet	EfficientNetV2	20.33M	EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet.
mit_b0_imagenet	MiT	3.32M	MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set.
mobilenet_v3_large_imagenet	MobileNetV3	2.99M	MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small_imagenet	MobileNetV3	933.50K	MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
resnet50_imagenet	ResNetV1	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task.
resnet50_v2_imagenet	ResNetV2	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
videoswin_base_kinetics400	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400	VideoSwinS	49.51M	A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_tiny_kinetics400	VideoSwinT	27.85M	A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset.
videoswin_base_kinetics400_imagenet22k	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics600_imagenet22k	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset
videoswin_base_something_something_v2	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
vitdet_base_sa1b	VitDet	89.67M	A base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b	VitDet	637.03M	A huge Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b	VitDet	308.28M	A large Detectron2 ViT backbone trained on the SA1B dataset.
yolo_v8_xs_backbone_coco	YOLOV8	1.28M	An extra small YOLOV8 backbone pretrained on COCO
yolo_v8_s_backbone_coco	YOLOV8	5.09M	A small YOLOV8 backbone pretrained on COCO
yolo_v8_m_backbone_coco	YOLOV8	11.87M	A medium YOLOV8 backbone pretrained on COCO
yolo_v8_l_backbone_coco	YOLOV8	19.83M	A large YOLOV8 backbone pretrained on COCO
yolo_v8_xl_backbone_coco	YOLOV8	30.97M	An extra large YOLOV8 backbone pretrained on COCO

任务预设

以下每个预设名称对应于任务模型的配置和权重。这些模型是应用就绪的，但如果需要，可以进一步微调。

下面的名称可以与 from_preset() 构造函数一起使用，以获取相应的任务模型。

object_detector = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",
    bounding_box_format="xywh",
)

请注意，所有主干预设也适用于任务。例如，您可以直接将 ResNetBackbone 预设与 RetinaNet 一起使用。在这种情况下，需要微调，因为特定于任务的层将被随机初始化。

backbone = keras_cv.models.RetinaNet.from_preset(
    "resnet50_imagenet",
    bounding_box_format="xywh",
)

为了简洁，我们在以下表中不包含主干预设。

注意：所有预训练权重应与未归一化的像素强度在 [0, 255] 范围内一起使用（如果 include_rescaling=True），或在 [0, 1] 范围内（如果 including_rescaling=False）。

Preset name	Model	Parameters	Description
resnet50_v2_imagenet_classifier	ImageClassifier	25.61M	ResNet classifier with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
efficientnetv2_s_imagenet_classifier	ImageClassifier	21.61M	ImageClassifier using the EfficientNet smallarchitecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9% top 1 accuracy and 96.7% top 5 accuracy on imagenet.
efficientnetv2_b0_imagenet_classifier	ImageClassifier	7.20M	ImageClassifier using the EfficientNet B0 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by `width_coefficient=1.0` and `depth_coefficient=1.0`. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet_classifier	ImageClassifier	8.21M	ImageClassifier using the EfficientNet B1 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by `width_coefficient=1.0` and `depth_coefficient=1.1`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet_classifier	ImageClassifier	10.18M	ImageClassifier using the EfficientNet B2 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by `width_coefficient=1.1` and `depth_coefficient1.2`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
mobilenet_v3_large_imagenet_classifier	ImageClassifier	3.96M	ImageClassifier using the MobileNetV3Large architecture. This preset uses a Dense layer as a classification head instead of the typical fully-convolutional MobileNet head. As a result, it has fewer parameters than the original MobileNetV3Large model, which has 5.4 million parameters.Published weights are capable of scoring 69.4% top-1 accuracy and 89.4% top 5 accuracy on imagenet.
videoswin_tiny_kinetics400_classifier	VideoClassifier	28.16M	A tiny Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 78.8% top1 and 93.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400_classifier	VideoClassifier	49.82M	A small Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics400_classifier	VideoClassifier	89.07M	A base Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_base_something_something_v2_classifier	VideoClassifier	88.83M	A base Video Swin architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
clip-vit-base-patch16	CLIP	149.62M	The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32	CLIP	151.28M	The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14	CLIP	427.62M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336	CLIP	427.94M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)
retinanet_resnet50_pascalvoc	RetinaNet	35.60M	RetinaNet with a ResNet50 v1 backbone. Trained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.33 on the evaluation set.
yolo_v8_m_pascalvoc	YOLOV8Detector	25.90M	YOLOV8-M pretrained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.45 on the evaluation set.
deeplab_v3_plus_resnet50_pascalvoc	DeepLabV3Plus	39.19M	DeeplabV3Plus with a ResNet50 v2 backbone. Trained on PascalVOC 2012 Semantic segmentation task, which consists of 20 classes and one background class. This model achieves a final categorical accuracy of 89.34% and mIoU of 0.6391 on evaluation dataset. This preset is only comptabile with Keras 3.
segformer_b0_imagenet	SegFormerB0	3.72M	SegFormer model with a pretrained MiTB0 backbone.
segformer_b0	SegFormerB0	3.72M	SegFormer model with MiTB0 backbone.
segformer_b1	SegFormerB1	13.68M	SegFormer model with MiTB1 backbone.
segformer_b2	SegFormerB2	24.73M	SegFormer model with MiTB2 backbone.
segformer_b3	SegFormerB3	44.60M	SegFormer model with MiTB3 backbone.
segformer_b4	SegFormerB4	61.37M	SegFormer model with MiTB4 backbone.
segformer_b5	SegFormerB5	81.97M	SegFormer model with MiTB5 backbone.
sam_base_sa1b	SAM	93.74M	The base SAM model trained on the SA1B dataset.
sam_large_sa1b	SAM	312.34M	The large SAM model trained on the SA1B dataset.
sam_huge_sa1b	SAM	641.09M	The huge SAM model trained on the SA1B dataset.

API 文档

任务

骨干网络

KerasCV 模型

◆ 主干预设

◆ 任务预设

◆ API 文档

任务

骨干网络