KerasCV 包含流行模型架构的端到端实现。这些模型可以通过两种方式创建:
from_preset()
构造函数,它使用预训练配置(可选)和权重实例化一个对象。可用的预设名称列在本页上。model = keras_cv.models.RetinaNet.from_preset(
"resnet50_v2_imagenet",
num_classes=20,
bounding_box_format="xywh",
)
backbone = keras_cv.models.ResNetBackbone(
stackwise_filters=[64, 128, 256, 512],
stackwise_blocks=[2, 2, 2, 2],
stackwise_strides=[1, 2, 2, 2],
include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
backbone=backbone,
num_classes=20,
bounding_box_format="xywh",
)
以下每个预设名称对应于主干模型的配置和权重。
下面的名称可以与 from_preset()
构造函数一起使用,以获取相应的主干模型。
backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")
为了简洁,我们在以下表中不包含没有预训练权重的预设。
注意:如果 include_rescaling=True
,则所有预训练权重应与未归一化的像素强度一起使用,范围为 [0, 255]
;如果 including_rescaling=False
,则范围为 [0, 1]
。
Preset name | Model | Parameters | Description |
---|---|---|---|
csp_darknet_l_imagenet | CSPDarkNet | 27.11M | CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task. |
csp_darknet_tiny_imagenet | CSPDarkNet | 2.38M | CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task. |
densenet121_imagenet | Unknown | Unknown | DenseNet model with 121 layers. Trained on Imagenet 2012 classification task. |
densenet169_imagenet | Unknown | Unknown | DenseNet model with 169 layers. Trained on Imagenet 2012 classification task. |
densenet201_imagenet | Unknown | Unknown | DenseNet model with 201 layers. Trained on Imagenet 2012 classification task. |
efficientnetv2_b0_imagenet | EfficientNetV2 | 5.92M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0 . Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet. |
efficientnetv2_b1_imagenet | EfficientNetV2 | 6.93M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet. |
efficientnetv2_b2_imagenet | EfficientNetV2 | 8.77M | EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet. |
efficientnetv2_s_imagenet | EfficientNetV2 | 20.33M | EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet. |
mit_b0_imagenet | MiT | 3.32M | MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set. |
mobilenet_v3_large_imagenet | MobileNetV3 | 2.99M | MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task. |
mobilenet_v3_small_imagenet | MobileNetV3 | 933.50K | MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task. |
resnet50_imagenet | ResNetV1 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task. |
resnet50_v2_imagenet | ResNetV2 | 23.56M | ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task. |
videoswin_base_kinetics400 | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset |
videoswin_small_kinetics400 | VideoSwinS | 49.51M | A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset |
videoswin_tiny_kinetics400 | VideoSwinT | 27.85M | A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. |
videoswin_base_kinetics400_imagenet22k | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset |
videoswin_base_kinetics600_imagenet22k | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset |
videoswin_base_something_something_v2 | VideoSwinB | 87.64M | A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset |
vitdet_base_sa1b | VitDet | 89.67M | A base Detectron2 ViT backbone trained on the SA1B dataset. |
vitdet_huge_sa1b | VitDet | 637.03M | A huge Detectron2 ViT backbone trained on the SA1B dataset. |
vitdet_large_sa1b | VitDet | 308.28M | A large Detectron2 ViT backbone trained on the SA1B dataset. |
yolo_v8_xs_backbone_coco | YOLOV8 | 1.28M | An extra small YOLOV8 backbone pretrained on COCO |
yolo_v8_s_backbone_coco | YOLOV8 | 5.09M | A small YOLOV8 backbone pretrained on COCO |
yolo_v8_m_backbone_coco | YOLOV8 | 11.87M | A medium YOLOV8 backbone pretrained on COCO |
yolo_v8_l_backbone_coco | YOLOV8 | 19.83M | A large YOLOV8 backbone pretrained on COCO |
yolo_v8_xl_backbone_coco | YOLOV8 | 30.97M | An extra large YOLOV8 backbone pretrained on COCO |
以下每个预设名称对应于任务模型的配置和权重。这些模型是应用就绪的,但如果需要,可以进一步微调。
下面的名称可以与 from_preset()
构造函数一起使用,以获取相应的任务模型。
object_detector = keras_cv.models.RetinaNet.from_preset(
"retinanet_resnet50_pascalvoc",
bounding_box_format="xywh",
)
请注意,所有主干预设也适用于任务。例如,您可以直接将 ResNetBackbone
预设与 RetinaNet
一起使用。在这种情况下,需要微调,因为特定于任务的层将被随机初始化。
backbone = keras_cv.models.RetinaNet.from_preset(
"resnet50_imagenet",
bounding_box_format="xywh",
)
为了简洁,我们在以下表中不包含主干预设。
注意:所有预训练权重应与未归一化的像素强度在 [0, 255]
范围内一起使用(如果 include_rescaling=True
),或在 [0, 1]
范围内(如果 including_rescaling=False
)。
Preset name | Model | Parameters | Description |
---|---|---|---|
resnet50_v2_imagenet_classifier | ImageClassifier | 25.61M | ResNet classifier with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task. |
efficientnetv2_s_imagenet_classifier | ImageClassifier | 21.61M | ImageClassifier using the EfficientNet smallarchitecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9% top 1 accuracy and 96.7% top 5 accuracy on imagenet. |
efficientnetv2_b0_imagenet_classifier | ImageClassifier | 7.20M | ImageClassifier using the EfficientNet B0 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.0 and depth_coefficient=1.0 . Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet. |
efficientnetv2_b1_imagenet_classifier | ImageClassifier | 8.21M | ImageClassifier using the EfficientNet B1 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.0 and depth_coefficient=1.1 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet. |
efficientnetv2_b2_imagenet_classifier | ImageClassifier | 10.18M | ImageClassifier using the EfficientNet B2 architecture. In this variant of the EfficientNet architecture, there are 6 convolutional blocks. As with all of the B style EfficientNet variants, the number of filters in each convolutional block is scaled by width_coefficient=1.1 and depth_coefficient1.2 . Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet. |
mobilenet_v3_large_imagenet_classifier | ImageClassifier | 3.96M | ImageClassifier using the MobileNetV3Large architecture. This preset uses a Dense layer as a classification head instead of the typical fully-convolutional MobileNet head. As a result, it has fewer parameters than the original MobileNetV3Large model, which has 5.4 million parameters.Published weights are capable of scoring 69.4% top-1 accuracy and 89.4% top 5 accuracy on imagenet. |
videoswin_tiny_kinetics400_classifier | VideoClassifier | 28.16M | A tiny Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 78.8% top1 and 93.6% top5 accuracy on the Kinetics 400 dataset |
videoswin_small_kinetics400_classifier | VideoClassifier | 49.82M | A small Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset |
videoswin_base_kinetics400_classifier | VideoClassifier | 89.07M | A base Video Swin architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset |
videoswin_base_something_something_v2_classifier | VideoClassifier | 88.83M | A base Video Swin architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset |
clip-vit-base-patch16 | CLIP | 149.62M | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) |
clip-vit-base-patch32 | CLIP | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
clip-vit-large-patch14 | CLIP | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
clip-vit-large-patch14-336 | CLIP | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
retinanet_resnet50_pascalvoc | RetinaNet | 35.60M | RetinaNet with a ResNet50 v1 backbone. Trained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.33 on the evaluation set. |
yolo_v8_m_pascalvoc | YOLOV8Detector | 25.90M | YOLOV8-M pretrained on PascalVOC 2012 object detection task, which consists of 20 classes. This model achieves a final MaP of 0.45 on the evaluation set. |
deeplab_v3_plus_resnet50_pascalvoc | DeepLabV3Plus | 39.19M | DeeplabV3Plus with a ResNet50 v2 backbone. Trained on PascalVOC 2012 Semantic segmentation task, which consists of 20 classes and one background class. This model achieves a final categorical accuracy of 89.34% and mIoU of 0.6391 on evaluation dataset. This preset is only comptabile with Keras 3. |
segformer_b0_imagenet | SegFormerB0 | 3.72M | SegFormer model with a pretrained MiTB0 backbone. |
segformer_b0 | SegFormerB0 | 3.72M | SegFormer model with MiTB0 backbone. |
segformer_b1 | SegFormerB1 | 13.68M | SegFormer model with MiTB1 backbone. |
segformer_b2 | SegFormerB2 | 24.73M | SegFormer model with MiTB2 backbone. |
segformer_b3 | SegFormerB3 | 44.60M | SegFormer model with MiTB3 backbone. |
segformer_b4 | SegFormerB4 | 61.37M | SegFormer model with MiTB4 backbone. |
segformer_b5 | SegFormerB5 | 81.97M | SegFormer model with MiTB5 backbone. |
sam_base_sa1b | SAM | 93.74M | The base SAM model trained on the SA1B dataset. |
sam_large_sa1b | SAM | 312.34M | The large SAM model trained on the SA1B dataset. |
sam_huge_sa1b | SAM | 641.09M | The huge SAM model trained on the SA1B dataset. |