📦 Segmentation Models¶

Unet¶

class segmentation_models_pytorch.Unet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Unet is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Use concatenation for fusing decoder blocks with skip connections.

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth

  • decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”

  • decoder_attention_type – Attention module used in decoder of the model. Available options are None and scse. SCSE paper - https://arxiv.org/abs/1808.08127

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Unet

Return type

torch.nn.Module

Unet++¶

class segmentation_models_pytorch.UnetPlusPlus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Unet++ is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Decoder of Unet++ is more complex than in usual Unet.

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth

  • decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”

  • decoder_attention_type – Attention module used in decoder of the model. Available options are None and scse. SCSE paper - https://arxiv.org/abs/1808.08127

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Unet++

Return type

torch.nn.Module

Reference:

https://arxiv.org/abs/1807.10165

MAnet¶

class segmentation_models_pytorch.MAnet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, decoder_channels=(256, 128, 64, 32, 16), decoder_pab_channels=64, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

MAnet : Multi-scale Attention Net. The MA-Net can capture rich contextual dependencies based on the attention mechanism, using two blocks:

  • Position-wise Attention Block (PAB), which captures the spatial dependencies between pixels in a global view

  • Multi-scale Fusion Attention Block (MFAB), which captures the channel dependencies between any feature map by multi-scale semantic feature fusion

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_channels – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth

  • decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”

  • decoder_pab_channels – A number of channels for PAB module in decoder. Default is 64.

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

MAnet

Return type

torch.nn.Module

Linknet¶

class segmentation_models_pytorch.Linknet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_batchnorm=True, in_channels=3, classes=1, activation=None, aux_params=None)[source]¶

Linknet is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Use sum for fusing decoder blocks with skip connections.

Note

This implementation by default has 4 skip connections (original - 3).

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

Linknet

Return type

torch.nn.Module

FPN¶

class segmentation_models_pytorch.FPN(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_pyramid_channels=256, decoder_segmentation_channels=128, decoder_merge_policy='add', decoder_dropout=0.2, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

FPN is a fully convolution neural network for image semantic segmentation.

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_pyramid_channels – A number of convolution filters in Feature Pyramid of FPN

  • decoder_segmentation_channels – A number of convolution filters in segmentation blocks of FPN

  • decoder_merge_policy – Determines how to merge pyramid features inside FPN. Available options are add and cat

  • decoder_dropout – Spatial dropout rate in range (0, 1) for feature pyramid in FPN

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

FPN

Return type

torch.nn.Module

PSPNet¶

class segmentation_models_pytorch.PSPNet(encoder_name='resnet34', encoder_weights='imagenet', encoder_depth=3, psp_out_channels=512, psp_use_batchnorm=True, psp_dropout=0.2, in_channels=3, classes=1, activation=None, upsampling=8, aux_params=None)[source]¶

PSPNet is a fully convolution neural network for image semantic segmentation. Consist of encoder and Spatial Pyramid (decoder). Spatial Pyramid build on top of encoder and does not use “fine-features” (features of high spatial resolution). PSPNet can be used for multiclass segmentation of high resolution images, however it is not good for detecting small objects and producing accurate, pixel-level mask.

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • psp_out_channels – A number of filters in Spatial Pyramid

  • psp_use_batchnorm – If True, BatchNorm2d layer between Conv2D and Activation layers is used. If “inplace” InplaceABN will be used, allows to decrease memory consumption. Available options are True, False, “inplace”

  • psp_dropout – Spatial dropout rate in [0, 1) used in Spatial Pyramid

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • upsampling – Final upsampling factor. Default is 8 to preserve input-output spatial shape identity

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

PSPNet

Return type

torch.nn.Module

PAN¶

class segmentation_models_pytorch.PAN(encoder_name='resnet34', encoder_weights='imagenet', encoder_dilation=True, decoder_channels=32, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

Implementation of PAN (Pyramid Attention Network).

Note

Currently works with shape of input tensor >= [B x C x 128 x 128] for pytorch <= 1.1.0 and with shape of input tensor >= [B x C x 256 x 256] for pytorch == 1.3.1

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • encoder_dilation – Flag to use dilation in encoder last layer. Doesn’t work with *ception*, vgg*, densenet*` backbones, default is True

  • decoder_channels – A number of convolution layer filters in decoder blocks

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

PAN

Return type

torch.nn.Module

DeepLabV3¶

class segmentation_models_pytorch.DeepLabV3(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_channels=256, in_channels=3, classes=1, activation=None, upsampling=8, aux_params=None)[source]¶

DeepLabV3 implementation from “Rethinking Atrous Convolution for Semantic Image Segmentation”

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • decoder_channels – A number of convolution filters in ASPP module. Default is 256

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • upsampling – Final upsampling factor. Default is 8 to preserve input-output spatial shape identity

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

DeepLabV3

Return type

torch.nn.Module

DeepLabV3+¶

class segmentation_models_pytorch.DeepLabV3Plus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', encoder_output_stride=16, decoder_channels=256, decoder_atrous_rates=(12, 24, 36), in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None)[source]¶

DeepLabV3+ implementation from “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”

Parameters
  • encoder_name – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution

  • encoder_depth – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5

  • encoder_weights – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)

  • encoder_output_stride – Downsampling factor for last encoder features (see original paper for explanation)

  • decoder_atrous_rates – Dilation rates for ASPP module (should be a tuple of 3 integer values)

  • decoder_channels – A number of convolution filters in ASPP module. Default is 256

  • in_channels – A number of input channels for the model, default is 3 (RGB images)

  • classes – A number of classes for output mask (or you can think as a number of channels of output mask)

  • activation – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None

  • upsampling – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity

  • aux_params –

    Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:

    • classes (int): A number of classes

    • pooling (str): One of “max”, “avg”. Default is “avg”

    • dropout (float): Dropout factor in [0, 1)

    • activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits)

Returns

DeepLabV3Plus

Return type

torch.nn.Module

Reference:

https://arxiv.org/abs/1802.02611v3