📦 Segmentation Models¶

Unet¶

class segmentation_models_pytorch.Unet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Unet is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Use concatenation for fusing decoder blocks with skip connections.

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”
decoder_attention_type – Attention module used in decoder of the model. Available options are None and scse. SCSE paper - https://arxiv.org/abs/1808.08127
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Unet

Return type

torch.nn.Module

Unet++¶

class segmentation_models_pytorch.UnetPlusPlus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Unet++ is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Decoder of Unet++ is more complex than in usual Unet.

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”
decoder_attention_type – Attention module used in decoder of the model. Available options are None and scse. SCSE paper - https://arxiv.org/abs/1808.08127
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Unet++

Return type

torch.nn.Module

Reference:: https://arxiv.org/abs/1807.10165

MAnet¶

class segmentation_models_pytorch.MAnet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_pab_channels=64, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

MAnet : Multi-scale Attention Net. The MA-Net can capture rich contextual dependencies based on the attention mechanism, using two blocks:

Position-wise Attention Block (PAB), which captures the spatial dependencies between pixels in a global view

Multi-scale Fusion Attention Block (MFAB), which captures the channel dependencies between any feature map by multi-scale semantic feature fusion

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”
decoder_pab_channels – A number of channels for PAB module in decoder. Default is 64.
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

MAnet

Return type

torch.nn.Module

Linknet¶

class segmentation_models_pytorch.Linknet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Linknet is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Use sum for fusing decoder blocks with skip connections.

Note

This implementation by default has 4 skip connections (original - 3).

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Linknet

Return type

torch.nn.Module

FPN¶

class segmentation_models_pytorch.FPN(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_pyramid_channels=256, decoder_segmentation_channels=128, decoder_merge_policy='add', decoder_dropout=0.2, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

FPN is a fully convolution neural network for image semantic segmentation.

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_pyramid_channels – A number of convolution filters in Feature Pyramid of FPN
decoder_segmentation_channels – A number of convolution filters in segmentation blocks of FPN
decoder_merge_policy – Determines how to merge pyramid features inside FPN. Available options are add and cat
decoder_dropout – Spatial dropout rate in range (0, 1) for feature pyramid in FPN
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

FPN

Return type

torch.nn.Module

PSPNet¶

class segmentation_models_pytorch.PSPNet(encoder_name='resnet34', encoder_weights='imagenet', encoder_depth=3, psp_out_channels=512, psp_use_batchnorm=True, psp_dropout=0.2, in_channels=3, classes=1, activation=None, upsampling=8, aux_params=None)[source]¶

PSPNet is a fully convolution neural network for image semantic segmentation. Consist of encoder and Spatial Pyramid (decoder). Spatial Pyramid build on top of encoder and does not use “fine-features” (features of high spatial resolution). PSPNet can be used for multiclass segmentation of high resolution images, however it is not good for detecting small objects and producing accurate, pixel-level mask.

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
psp_out_channels – A number of filters in Spatial Pyramid
psp_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”
psp_dropout – Spatial dropout rate in [0, 1) used in Spatial Pyramid
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
upsampling – Final upsampling factor. Default is 8 to preserve input-output spatial shape identity
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

PSPNet

Return type

torch.nn.Module

PAN¶

class segmentation_models_pytorch.PAN(encoder_name='resnet34', encoder_weights='imagenet', encoder_dilation=True, decoder_channels=32, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

Implementation of PAN (Pyramid Attention Network).

Note

Currently works with shape of input tensor >= [B x C x 128 x 128] for pytorch <= 1.1.0 and with shape of input tensor >= [B x C x 256 x 256] for pytorch == 1.3.1

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
encoder_dilation – Flag to use dilation in encoder last layer. Doesn’t work with *ception*, vgg*, densenet*` backbones, default is True
decoder_channels – A number of convolution layer filters in decoder blocks
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

PAN

Return type

torch.nn.Module

DeepLabV3¶

class segmentation_models_pytorch.DeepLabV3(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_channels=256, in_channels=3, classes=1, activation=None, upsampling=8, aux_params=None)[source]¶

DeepLabV3 implementation from “Rethinking Atrous Convolution for Semantic Image Segmentation”

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels – A number of convolution filters in ASPP module. Default is 256
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
upsampling – Final upsampling factor. Default is 8 to preserve input-output spatial shape identity
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

DeepLabV3

Return type

torch.nn.Module

DeepLabV3+¶

class segmentation_models_pytorch.DeepLabV3Plus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', encoder_output_stride=16, decoder_channels=256, decoder_atrous_rates=(12, 24, 36), in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

DeepLabV3+ implementation from “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”

Parameters

encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
encoder_output_stride – Downsampling factor for last encoder features (see original paper for explanation)
decoder_atrous_rates – Dilation rates for ASPP module (should be a tuple of 3 integer values)
decoder_channels – A number of convolution filters in ASPP module. Default is 256
in_channels – A number of input channels for the model, default is 3 (RGB images)
classes – A number of classes for output mask (or you can think as a number of channels of output mask)
activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None
upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity
aux_params –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
- classes (int): A number of classes
- pooling (str): One of “max”, “avg”. Default is “avg”
- dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

DeepLabV3Plus

Return type

torch.nn.Module

Reference:: https://arxiv.org/abs/1802.02611v3