🕸️ Segmentation Models#
Unet#
- class segmentation_models_pytorch.Unet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_norm='batchnorm', decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, decoder_interpolation='nearest', in_channels=3, classes=1, activation=None, aux_params=None, **kwargs)[source]#
U-Net is a fully convolutional neural network architecture designed for semantic image segmentation.
It consists of two main parts:
An encoder (downsampling path) that extracts increasingly abstract features
A decoder (upsampling path) that gradually recovers spatial details
The key is the use of skip connections between corresponding encoder and decoder layers. These connections allow the decoder to access fine-grained details from earlier encoder layers, which helps produce more precise segmentation masks.
The skip connections work by concatenating feature maps from the encoder directly into the decoder at corresponding resolutions. This helps preserve important spatial information that would otherwise be lost during the encoding process.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels (Sequence[int]) – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python decoder_use_norm={"type": "layernorm", "eps": 1e-2} `
decoder_attention_type (str | None) – Attention module used in decoder of the model. Available options are None and scse (https://arxiv.org/abs/1808.08127).
decoder_interpolation (str) – Interpolation mode used in decoder of the model. Available options are “nearest”, “bilinear”, “bicubic”, “area”, “nearest-exact”. Default is “nearest”.
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
Unet
- Return type:
torch.nn.Module
Example
import torch import segmentation_models_pytorch as smp model = smp.Unet("resnet18", encoder_weights="imagenet", classes=5) model.eval() # generate random images images = torch.rand(2, 3, 256, 256) with torch.inference_mode(): mask = model(images) print(mask.shape) # torch.Size([2, 5, 256, 256])
Unet++#
- class segmentation_models_pytorch.UnetPlusPlus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_norm='batchnorm', decoder_channels=(256, 128, 64, 32, 16), decoder_attention_type=None, decoder_interpolation='nearest', in_channels=3, classes=1, activation=None, aux_params=None, **kwargs)[source]#
Unet++ is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Decoder of Unet++ is more complex than in usual Unet.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels (Sequence[int]) – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python decoder_use_norm={"type": "layernorm", "eps": 1e-2} `
decoder_attention_type (str | None) – Attention module used in decoder of the model. Available options are None and scse (https://arxiv.org/abs/1808.08127).
decoder_interpolation (str) – Interpolation mode used in decoder of the model. Available options are “nearest”, “bilinear”, “bicubic”, “area”, “nearest-exact”. Default is “nearest”.
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
Unet++
- Return type:
torch.nn.Module
- Reference:
FPN#
- class segmentation_models_pytorch.FPN(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_pyramid_channels=256, decoder_segmentation_channels=128, decoder_merge_policy='add', decoder_dropout=0.2, decoder_interpolation='nearest', in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None, **kwargs)[source]#
FPN_ is a fully convolution neural network for image semantic segmentation.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_pyramid_channels (int) – A number of convolution filters in Feature Pyramid of FPN_
decoder_segmentation_channels (int) – A number of convolution filters in segmentation blocks of FPN_
decoder_merge_policy (str) – Determines how to merge pyramid features inside FPN. Available options are add and cat
decoder_dropout (float) – Spatial dropout rate in range (0, 1) for feature pyramid in FPN_
decoder_interpolation (str) – Interpolation mode used in decoder of the model. Available options are “nearest”, “bilinear”, “bicubic”, “area”, “nearest-exact”. Default is “nearest”.
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int) – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
FPN
- Return type:
torch.nn.Module
PSPNet#
- class segmentation_models_pytorch.PSPNet(encoder_name='resnet34', encoder_weights='imagenet', encoder_depth=3, psp_out_channels=512, decoder_use_norm='batchnorm', psp_dropout=0.2, in_channels=3, classes=1, activation=None, upsampling=8, aux_params=None, **kwargs)[source]#
PSPNet_ is a fully convolution neural network for image semantic segmentation. Consist of encoder and Spatial Pyramid (decoder). Spatial Pyramid build on top of encoder and does not use “fine-features” (features of high spatial resolution). PSPNet can be used for multiclass segmentation of high resolution images, however it is not good for detecting small objects and producing accurate, pixel-level mask.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
psp_out_channels (int) – A number of filters in Spatial Pyramid
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python decoder_use_norm={"type": "layernorm", "eps": 1e-2} `
psp_dropout (float) – Spatial dropout rate in [0, 1) used in Spatial Pyramid
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int) – Final upsampling factor. Default is 8 to preserve input-output spatial shape identity
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
PSPNet
- Return type:
torch.nn.Module
DeepLabV3#
- class segmentation_models_pytorch.DeepLabV3(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', encoder_output_stride=8, decoder_channels=256, decoder_atrous_rates=(12, 24, 36), decoder_aspp_separable=False, decoder_aspp_dropout=0.5, in_channels=3, classes=1, activation=None, upsampling=None, aux_params=None, **kwargs)[source]#
DeepLabV3_ implementation from “Rethinking Atrous Convolution for Semantic Image Segmentation”
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels (int) – A number of convolution filters in ASPP module. Default is 256
encoder_output_stride (Literal[8, 16]) – Downsampling factor for last encoder features (see original paper for explanation)
decoder_atrous_rates (Iterable[int]) – Dilation rates for ASPP module (should be an iterable of 3 integer values)
decoder_aspp_separable (bool) – Use separable convolutions in ASPP module. Default is False
decoder_aspp_dropout (float) – Use dropout in ASPP module projection layer. Default is 0.5
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int | None) – Final upsampling factor. Default is None to preserve input-output spatial shape identity
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
DeepLabV3
- Return type:
torch.nn.Module
DeepLabV3+#
- class segmentation_models_pytorch.DeepLabV3Plus(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', encoder_output_stride=16, decoder_channels=256, decoder_atrous_rates=(12, 24, 36), decoder_aspp_separable=True, decoder_aspp_dropout=0.5, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None, **kwargs)[source]#
DeepLabV3+ implementation from “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (Literal[3, 4, 5]) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
encoder_output_stride (Literal[8, 16]) – Downsampling factor for last encoder features (see original paper for explanation)
decoder_atrous_rates (Iterable[int]) – Dilation rates for ASPP module (should be an iterable of 3 integer values)
decoder_aspp_separable (bool) – Use separable convolutions in ASPP module. Default is True
decoder_aspp_dropout (float) – Use dropout in ASPP module projection layer. Default is 0.5
decoder_channels (int) – A number of convolution filters in ASPP module. Default is 256
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int) – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
DeepLabV3Plus
- Return type:
torch.nn.Module
- Reference:
Linknet#
- class segmentation_models_pytorch.Linknet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_norm='batchnorm', in_channels=3, classes=1, activation=None, aux_params=None, **kwargs)[source]#
Linknet_ is a fully convolution neural network for image semantic segmentation. Consist of encoder and decoder parts connected with skip connections. Encoder extract features of different spatial resolution (skip connections) which are used by decoder to define accurate segmentation mask. Use sum for fusing decoder blocks with skip connections.
Note
This implementation by default has 4 skip connections (original - 3).
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python decoder_use_norm={"type": "layernorm", "eps": 1e-2} `
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
Linknet
- Return type:
torch.nn.Module
MAnet#
- class segmentation_models_pytorch.MAnet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_use_norm='batchnorm', decoder_channels=(256, 128, 64, 32, 16), decoder_pab_channels=64, decoder_interpolation='nearest', in_channels=3, classes=1, activation=None, aux_params=None, **kwargs)[source]#
MAnet_ : Multi-scale Attention Net. The MA-Net can capture rich contextual dependencies based on the attention mechanism, using two blocks:
Position-wise Attention Block (PAB), which captures the spatial dependencies between pixels in a global view
Multi-scale Fusion Attention Block (MFAB), which captures the channel dependencies between any feature map by multi-scale semantic feature fusion
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_channels (Sequence[int]) – List of integers which specify in_channels parameter for convolutions used in decoder. Length of the list should be the same as encoder_depth
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python decoder_use_norm={"type": "layernorm", "eps": 1e-2} `
decoder_pab_channels (int) – A number of channels for PAB module in decoder. Default is 64.
decoder_interpolation (str) – Interpolation mode used in decoder of the model. Available options are “nearest”, “bilinear”, “bicubic”, “area”, “nearest-exact”. Default is “nearest”.
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
MAnet
- Return type:
torch.nn.Module
PAN#
- class segmentation_models_pytorch.PAN(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', encoder_output_stride=16, decoder_channels=32, decoder_interpolation='bilinear', in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None, **kwargs)[source]#
Implementation of PAN_ (Pyramid Attention Network).
Note
Currently works with shape of input tensor >= [B x C x 128 x 128] for pytorch <= 1.1.0 and with shape of input tensor >= [B x C x 256 x 256] for pytorch == 1.3.1
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (Literal[3, 4, 5]) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
encoder_output_stride (Literal[16, 32]) – 16 or 32, if 16 use dilation in encoder last layer. Doesn’t work with *ception*, vgg*, densenet*` backbones.Default is 16.
decoder_channels (int) – A number of convolution layer filters in decoder blocks
decoder_interpolation (str) – Interpolation mode used in decoder of the model. Available options are “nearest”, “bilinear”, “bicubic”, “area”, “nearest-exact”. Default is “bilinear”.
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int) – Final upsampling factor. Default is 4 to preserve input-output spatial shape identity
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
PAN
- Return type:
torch.nn.Module
UPerNet#
- class segmentation_models_pytorch.UPerNet(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_channels=256, decoder_use_norm='batchnorm', in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None, **kwargs)[source]#
UPerNet is a unified perceptual parsing network for image segmentation.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_pyramid_channels – A number of convolution filters in Feature Pyramid, default is 256
decoder_segmentation_channels – A number of convolution filters in segmentation blocks, default is 64
decoder_use_norm (bool | str | Dict[str, Any]) –
Specifies normalization between Conv2D and activation. Accepts the following types: - True: Defaults to “batchnorm”. - False: No normalization (nn.Identity). - str: Specifies normalization type using default parameters. Available values:
”batchnorm”, “identity”, “layernorm”, “instancenorm”, “inplace”.
dict: Fully customizable normalization settings. Structure:
`python {"type": <norm_type>, **kwargs} `
where norm_name corresponds to normalization type (see above), and kwargs are passed directly to the normalization layer as defined in PyTorch documentation.
Example:
`python use_norm={"type": "layernorm", "eps": 1e-2} `
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.decoder_channels (int)
upsampling (int)
- Returns:
UPerNet
- Return type:
torch.nn.Module
Segformer#
- class segmentation_models_pytorch.Segformer(encoder_name='resnet34', encoder_depth=5, encoder_weights='imagenet', decoder_segmentation_channels=256, in_channels=3, classes=1, activation=None, upsampling=4, aux_params=None, **kwargs)[source]#
Segformer is simple and efficient design for semantic segmentation with Transformers
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution
encoder_depth (int) – A number of stages used in encoder in range [3, 5]. Each stage generate features two times smaller in spatial dimensions than previous one (e.g. for depth 0 we will have features with shapes [(N, C, H, W),], for depth 1 - [(N, C, H, W), (N, C, H // 2, W // 2)] and so on). Default is 5
encoder_weights (str | None) – One of None (random initialization), “imagenet” (pre-training on ImageNet) and other pretrained weights (see table with available weights for each encoder_name)
decoder_segmentation_channels (int) – A number of convolution filters in segmentation blocks, default is 256
in_channels (int) – A number of input channels for the model, default is 3 (RGB images)
classes (int) – A number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
upsampling (int) – A number to upsample the output of the model, default is 4 (same size as input)
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes
pooling (str): One of “max”, “avg”. Default is “avg”
dropout (float): Dropout factor in [0, 1)
- activation (str): An activation function to apply “sigmoid”/”softmax”
(could be None to return logits)
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing.
- Returns:
Segformer
- Return type:
torch.nn.Module
DPT#
Note
See full list of DPT-compatible timm encoders in DPT Encoders.
Note
For some encoders, the model requires dynamic_img_size=True
to be passed in order to work with resolutions different from what the encoder was trained for.
- class segmentation_models_pytorch.DPT(encoder_name='tu-vit_base_patch16_224.augreg_in21k', encoder_depth=4, encoder_weights='imagenet', encoder_output_indices=None, decoder_readout='cat', decoder_intermediate_channels=(256, 512, 1024, 1024), decoder_fusion_channels=256, in_channels=3, classes=1, activation=None, aux_params=None, **kwargs)[source]#
DPT is a dense prediction architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks
It assembles tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combines them into full-resolution predictions using a convolutional decoder.
The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks
Note
Since this model uses a Vision Transformer backbone, it typically requires a fixed input image size. To handle variable input sizes, you can set dynamic_img_size=True in the model initialization (if supported by the specific timm encoder). You can check if an encoder requires fixed size using model.encoder.is_fixed_input_size, and get the required input dimensions from model.encoder.input_size, however it’s no guarantee that information is available.
- Parameters:
encoder_name (str) – Name of the classification model that will be used as an encoder (a.k.a backbone) to extract features of different spatial resolution.
encoder_depth (int) – A number of stages used in encoder in range [1,4]. Each stage generate features smaller by a factor equal to the ViT model patch_size in spatial dimensions. Default is 4.
encoder_weights (str | None) – One of None (random initialization), or not None (pretrained weights would be loaded with respect to the encoder_name, e.g. for
"tu-vit_base_patch16_224.augreg_in21k"
-"augreg_in21k"
weights would be loaded).encoder_output_indices (list[int] | None) – The indices of the encoder output features to use. If None will be sampled uniformly across the number of blocks in encoder, e.g. if number of blocks is 4 and encoder has 20 blocks, then encoder_output_indices will be (4, 9, 14, 19). If specified the number of indices should be equal to encoder_depth. Default is None.
decoder_readout (Literal['ignore', 'add', 'cat']) – The strategy to utilize the prefix tokens (e.g. cls_token) from the encoder. Can be one of “cat”, “add”, or “ignore”. Default is “cat”.
decoder_intermediate_channels (Sequence[int]) – The number of channels for the intermediate decoder layers. Reduce if you want to reduce the number of parameters in the decoder. Default is (256, 512, 1024, 1024).
decoder_fusion_channels (int) – The latent dimension to which the encoder features will be projected to before fusion. Default is 256.
in_channels (int) – Number of input channels for the model, default is 3 (RGB images)
classes (int) – Number of classes for output mask (or you can think as a number of channels of output mask)
activation (str | Callable | None) – An activation function to apply after the final convolution layer. Available options are “sigmoid”, “softmax”, “logsoftmax”, “tanh”, “identity”, callable and None. Default is None.
aux_params (dict | None) –
Dictionary with parameters of the auxiliary output (classification head). Auxiliary output is build on top of encoder if aux_params is not None (default). Supported params:
classes (int): A number of classes;
pooling (str): One of “max”, “avg”. Default is “avg”;
dropout (float): Dropout factor in [0, 1);
activation (str): An activation function to apply “sigmoid”/”softmax” (could be None to return logits).
kwargs (dict[str, Any]) – Arguments passed to the encoder class
__init__()
function. Applies only totimm
models. Keys withNone
values are pruned before passing. Specifydynamic_img_size=True
to allow the model to handle images of different sizes.
- Returns:
DPT
- Return type:
torch.nn.Module