Keras 3 API 文档 / KerasCV / KerasCV 模型

KerasCV 模型

KerasCV 包含流行模型架构的端到端实现。这些模型可以通过两种方式创建:

  • 通过 from_preset() 构造函数,它使用预训练配置(可选)和权重实例化一个对象。可用的预设名称列在本页上。
model = keras_cv.models.RetinaNet.from_preset(
    "resnet50_v2_imagenet",
    num_classes=20,
    bounding_box_format="xywh",
)
  • 通过用户控制的自定义配置。要做到这一点,只需将所需的配置参数传递给下面文档中符号的默认构造函数。
backbone = keras_cv.models.ResNetBackbone(
    stackwise_filters=[64, 128, 256, 512],
    stackwise_blocks=[2, 2, 2, 2],
    stackwise_strides=[1, 2, 2, 2],
    include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
    backbone=backbone,
    num_classes=20,
    bounding_box_format="xywh",
)

主干预设

以下每个预设名称对应于主干模型的配置和权重。

下面的名称可以与 from_preset() 构造函数一起使用,以获取相应的主干模型。

backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")

为了简洁,我们在以下表中不包含没有预训练权重的预设。

注意:如果 include_rescaling=True,则所有预训练权重应与未归一化的像素强度一起使用,范围为 [0, 255];如果 including_rescaling=False,则范围为 [0, 1]

Preset name Model Parameters Description
csp_darknet_l_imagenet CSPDarkNet 27.11M CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny_imagenet CSPDarkNet 2.38M CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
densenet121_imagenet Unknown Unknown DenseNet model with 121 layers. Trained on Imagenet 2012 classification task.
densenet169_imagenet Unknown Unknown DenseNet model with 169 layers. Trained on Imagenet 2012 classification task.
densenet201_imagenet Unknown Unknown DenseNet model with 201 layers. Trained on Imagenet 2012 classification task.
efficientnetv2_b0_imagenet EfficientNetV2 5.92M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet EfficientNetV2 6.93M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet EfficientNetV2 8.77M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
efficientnetv2_s_imagenet EfficientNetV2 20.33M EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet.
mit_b0_imagenet MiT 3.32M MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set.
mobilenet_v3_large_imagenet MobileNetV3 2.99M MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small_imagenet MobileNetV3 933.50K MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
resnet50_imagenet ResNetV1 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task.
resnet50_v2_imagenet ResNetV2 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
videoswin_base_kinetics400 VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400 VideoSwinS 49.51M A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_tiny_kinetics400 VideoSwinT 27.85M A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset.
videoswin_base_kinetics400_imagenet22k VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics600_imagenet22k VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset
videoswin_base_something_something_v2 VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
vitdet_base_sa1b VitDet 89.67M A base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b VitDet 637.03M A huge Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b VitDet 308.28M A large Detectron2 ViT backbone trained on the SA1B dataset.
yolo_v8_xs_backbone_coco YOLOV8 1.28M An extra small YOLOV8 backbone pretrained on COCO
yolo_v8_s_backbone_coco YOLOV8 5.09M A small YOLOV8 backbone pretrained on COCO
yolo_v8_m_backbone_coco YOLOV8 11.87M A medium YOLOV8 backbone pretrained on COCO
yolo_v8_l_backbone_coco YOLOV8 19.83M A large YOLOV8 backbone pretrained on COCO
yolo_v8_xl_backbone_coco YOLOV8 30.97M An extra large YOLOV8 backbone pretrained on COCO

任务预设

以下每个预设名称对应于任务模型的配置和权重。这些模型是应用就绪的,但如果需要,可以进一步微调。

下面的名称可以与 from_preset() 构造函数一起使用,以获取相应的任务模型。

object_detector = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",
    bounding_box_format="xywh",
)

请注意,所有主干预设也适用于任务。例如,您可以直接将 ResNetBackbone 预设与 RetinaNet 一起使用。在这种情况下,需要微调,因为特定于任务的层将被随机初始化。

backbone = keras_cv.models.RetinaNet.from_preset(
    "resnet50_imagenet",
    bounding_box_format="xywh",
)

为了简洁,我们在以下表中不包含主干预设。

注意:所有预训练权重应与未归一化的像素强度在 [0, 255] 范围内一起使用(如果 include_rescaling=True),或在 [0, 1] 范围内(如果 including_rescaling=False)。

Preset name Model Parameters Description
resnet50_v2_imagenet_classifier ImageClassifier 25.61M ResNet classifier with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
efficientnetv2_s_imagenet_classifier ImageClassifier 21.61M ImageClassifier using the EfficientNet smallarchitecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9% top 1 accuracy and 96.7% top 5 accuracy on imagenet.
efficientnetv2_b0_imagenet_classifier ImageClassifier 7.20M ImageClassifier using the EfficientNet B0 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.0 and depth_coefficient=1.0. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet_classifier ImageClassifier 8.21M ImageClassifier using the EfficientNet B1 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.0 and depth_coefficient=1.1. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet_classifier ImageClassifier 10.18M ImageClassifier using the EfficientNet B2 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.1 and depth_coefficient1.2. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
mobilenet_v3_large_imagenet_classifier ImageClassifier 3.96M ImageClassifier using the MobileNetV3Large architecture. This preset uses a Dense layer as a classification head instead of the typical fully-convolutional MobileNet head. As a result, it has fewer parameters than the original MobileNetV3Large model, which has 5.4 million parameters.Published weights are capable of scoring 69.4% top-1 accuracy and 89.4% top 5 accuracy on imagenet.
videoswin_tiny_kinetics400_classifier VideoClassifier 28.16M A tiny Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 78.8% top1 and 93.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400_classifier VideoClassifier 49.82M A small Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics400_classifier VideoClassifier 89.07M A base Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_base_something_something_v2_classifier VideoClassifier 88.83M A base Video Swin architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
clip-vit-base-patch16 CLIP 149.62M The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32 CLIP 151.28M The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14 CLIP 427.62M The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336 CLIP 427.94M The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)
retinanet_resnet50_pascalvoc RetinaNet 35.60M RetinaNet with a ResNet50 v1 backbone. Trained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.33 on the evaluation set.
yolo_v8_m_pascalvoc YOLOV8Detector 25.90M YOLOV8-M pretrained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.45 on the evaluation set.
deeplab_v3_plus_resnet50_pascalvoc DeepLabV3Plus 39.19M DeeplabV3Plus with a ResNet50 v2 backbone. Trained on PascalVOC 2012 Semantic segmentation task, which consists of 20 classes and one background class. This model achieves a final categorical accuracy of 89.34% and mIoU of 0.6391 on evaluation dataset. This preset is only comptabile with Keras 3.
segformer_b0_imagenet SegFormerB0 3.72M SegFormer model with a pretrained MiTB0 backbone.
segformer_b0 SegFormerB0 3.72M SegFormer model with MiTB0 backbone.
segformer_b1 SegFormerB1 13.68M SegFormer model with MiTB1 backbone.
segformer_b2 SegFormerB2 24.73M SegFormer model with MiTB2 backbone.
segformer_b3 SegFormerB3 44.60M SegFormer model with MiTB3 backbone.
segformer_b4 SegFormerB4 61.37M SegFormer model with MiTB4 backbone.
segformer_b5 SegFormerB5 81.97M SegFormer model with MiTB5 backbone.
sam_base_sa1b SAM 93.74M The base SAM model trained on the SA1B dataset.
sam_large_sa1b SAM 312.34M The large SAM model trained on the SA1B dataset.
sam_huge_sa1b SAM 641.09M The huge SAM model trained on the SA1B dataset.

API 文档

任务

骨干网络