Randomly zero out entire channels (a channel is a 2D feature map,
e.g., the :math:`j`-th channel of the :math:`i`-th sample in the
batched input is a 2D tensor :math:`\text{input}[i, j]`) of the input tensor).
- Each channel will be zeroed out independently on every forward call.
- with probability :attr:`p` using samples from a Bernoulli distribution.
+ Each channel will be zeroed out independently on every forward call with
+ probability :attr:`p` using samples from a Bernoulli distribution.
See :class:`~torch.nn.Dropout2d` for details.
Randomly zero out entire channels (a channel is a 3D feature map,
e.g., the :math:`j`-th channel of the :math:`i`-th sample in the
batched input is a 3D tensor :math:`\text{input}[i, j]`) of the input tensor).
- Each channel will be zeroed out independently on every forward call.
- with probability :attr:`p` using samples from a Bernoulli distribution.
+ Each channel will be zeroed out independently on every forward call with
+ probability :attr:`p` using samples from a Bernoulli distribution.
See :class:`~torch.nn.Dropout3d` for details.
def embedding_bag(input, weight, offsets=None, max_norm=None, norm_type=2,
scale_grad_by_freq=False, mode='mean', sparse=False):
# type: (Tensor, Tensor, Optional[Tensor], Optional[float], float, bool, str, bool) -> Tensor
- r"""Computes sums, means or maxes of 'bags' of embeddings, without instantiating the
+ r"""Computes sums, means or maxes of `bags` of embeddings, without instantiating the
intermediate embeddings.
See :class:`torch.nn.EmbeddingBag` for more details.
+
.. include:: cuda_deterministic_backward.rst
Args:
- :attr:`input` (LongTensor) and :attr:`offsets` (LongTensor, optional)
- - If :attr:`input` is 2D of shape ``B x N``,
+ - If :attr:`input` is 2D of shape `(B, N)`,
it will be treated as ``B`` bags (sequences) each of fixed length ``N``, and
this will return ``B`` values aggregated in a way depending on the :attr:`mode`.
:attr:`offsets` is ignored and required to be ``None`` in this case.
- - If :attr:`input` is 1D of shape ``N``,
+ - If :attr:`input` is 1D of shape `(N)`,
it will be treated as a concatenation of multiple bags (sequences).
:attr:`offsets` is required to be a 1D tensor containing the
starting index positions of each bag in :attr:`input`. Therefore,
- for :attr:`offsets` of shape ``B``, :attr:`input` will be viewed as
+ for :attr:`offsets` of shape `(B)`, :attr:`input` will be viewed as
having ``B`` bags. Empty bags (i.e., having 0-length) will have
returned vectors filled by zeros.
- :attr:`weight` (Tensor): the learnable weights of the module of
- shape ``(num_embeddings x embedding_dim)``
+ shape `(num_embeddings, embedding_dim)`
- - :attr:`output`: aggregated embedding values of shape ``B x embedding_dim``
+ - :attr:`output`: aggregated embedding values of shape `(B, embedding_dim)`
Examples::
function:
.. math::
- h_t = \text{tanh}(w_{ih} x_t + b_{ih} + w_{hh} h_{(t-1)} + b_{hh})
+ h_t = \text{tanh}(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh})
where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is
the input at time `t`, and :math:`h_{(t-1)}` is the hidden state of the
previous layer at time `t-1` or the initial hidden state at time `0`.
- If :attr:`nonlinearity` is `'relu'`, then `ReLU` is used instead of `tanh`.
+ If :attr:`nonlinearity` is ``'relu'``, then `ReLU` is used instead of `tanh`.
Args:
input_size: The number of expected features in the input `x`
would mean stacking two RNNs together to form a `stacked RNN`,
with the second RNN taking in outputs of the first RNN and
computing the final results. Default: 1
- nonlinearity: The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'
+ nonlinearity: The non-linearity to use. Can be either ``'tanh'`` or ``'relu'``. Default: ``'tanh'``
bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
Default: ``True``
batch_first: If ``True``, then the input and output tensors are provided
Outputs: output, h_n
- **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
- containing the output features (`h_k`) from the last layer of the RNN,
- for each `k`. If a :class:`torch.nn.utils.rnn.PackedSequence` has
+ containing the output features (`h_t`) from the last layer of the RNN,
+ for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has
been given as the input, the output will also be a packed sequence.
For the unpacked case, the directions can be separated
using ``output.view(seq_len, batch, num_directions, hidden_size)``,
with forward and backward being direction `0` and `1` respectively.
Similarly, the directions can be separated in the packed case.
- - **h_n** (num_layers * num_directions, batch, hidden_size): tensor
- containing the hidden state for `k = seq_len`.
+ - **h_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+ containing the hidden state for `t = seq_len`.
Like *output*, the layers can be separated using
``h_n.view(num_layers, num_directions, batch, hidden_size)``.
Attributes:
weight_ih_l[k]: the learnable input-hidden weights of the k-th layer,
- of shape `(hidden_size * input_size)` for `k = 0`. Otherwise, the shape is
- `(hidden_size * hidden_size)`
+ of shape `(hidden_size, input_size)` for `k = 0`. Otherwise, the shape is
+ `(hidden_size, num_directions * hidden_size)`
weight_hh_l[k]: the learnable hidden-hidden weights of the k-th layer,
- of shape `(hidden_size * hidden_size)`
+ of shape `(hidden_size, hidden_size)`
bias_ih_l[k]: the learnable input-hidden bias of the k-th layer,
of shape `(hidden_size)`
bias_hh_l[k]: the learnable hidden-hidden bias of the k-th layer,
is the hidden state of the layer at time `t-1` or the initial hidden
state at time `0`, and :math:`i_t`, :math:`f_t`, :math:`g_t`,
:math:`o_t` are the input, forget, cell, and output gates, respectively.
- :math:`\sigma` is the sigmoid function.
+ :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
- In a multilayer LSTM, the input :math:`i^{(l)}_t` of the :math:`l` -th layer
+ In a multilayer LSTM, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
(:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
- dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)_t}` is a Bernoulli random
+ dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
variable which is :math:`0` with probability :attr:`dropout`.
Args:
:func:`torch.nn.utils.rnn.pack_sequence` for details.
- **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
containing the initial hidden state for each element in the batch.
- If the RNN is bidirectional, num_directions should be 2, else it should be 1.
+ If the LSTM is bidirectional, num_directions should be 2, else it should be 1.
- **c_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
containing the initial cell state for each element in the batch.
Outputs: output, (h_n, c_n)
- **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
containing the output features `(h_t)` from the last layer of the LSTM,
- for each t. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
+ for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
given as the input, the output will also be a packed sequence.
For the unpacked case, the directions can be separated
Like *output*, the layers can be separated using
``h_n.view(num_layers, num_directions, batch, hidden_size)`` and similarly for *c_n*.
- - **c_n** (num_layers * num_directions, batch, hidden_size): tensor
- containing the cell state for `t = seq_len`
+ - **c_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+ containing the cell state for `t = seq_len`.
Attributes:
weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
- `(W_ii|W_if|W_ig|W_io)`, of shape `(4*hidden_size x input_size)`
+ `(W_ii|W_if|W_ig|W_io)`, of shape `(4*hidden_size, input_size)` for `k = 0`.
+ Otherwise, the shape is `(4*hidden_size, num_directions * hidden_size)`
weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
- `(W_hi|W_hf|W_hg|W_ho)`, of shape `(4*hidden_size x hidden_size)`
+ `(W_hi|W_hf|W_hg|W_ho)`, of shape `(4*hidden_size, hidden_size)`
bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
`(b_ii|b_if|b_ig|b_io)`, of shape `(4*hidden_size)`
bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
\begin{array}{ll}
r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
- n_t = \tanh(W_{in} x_t + b_{in} + r_t (W_{hn} h_{(t-1)}+ b_{hn})) \\
+ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
\end{array}
at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
:math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
- :math:`\sigma` is the sigmoid function.
+ :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
- In a multilayer GRU, the input :math:`i^{(l)}_t` of the :math:`l` -th layer
+ In a multilayer GRU, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
(:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
- dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)_t}` is a Bernoulli random
+ dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
variable which is :math:`0` with probability :attr:`dropout`.
Args:
Outputs: output, h_n
- **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
containing the output features h_t from the last layer of the GRU,
- for each t. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
+ for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
given as the input, the output will also be a packed sequence.
For the unpacked case, the directions can be separated
using ``output.view(seq_len, batch, num_directions, hidden_size)``,
Attributes:
weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
- (W_ir|W_iz|W_in), of shape `(3*hidden_size x input_size)`
+ (W_ir|W_iz|W_in), of shape `(3*hidden_size, input_size)` for `k = 0`.
+ Otherwise, the shape is `(3*hidden_size, num_directions * hidden_size)`
weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
- (W_hr|W_hz|W_hn), of shape `(3*hidden_size x hidden_size)`
+ (W_hr|W_hz|W_hn), of shape `(3*hidden_size, hidden_size)`
bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
(b_ir|b_iz|b_in), of shape `(3*hidden_size)`
bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
.. math::
- h' = \tanh(w_{ih} x + b_{ih} + w_{hh} h + b_{hh})
+ h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})
If :attr:`nonlinearity` is `'relu'`, then ReLU is used in place of tanh.
hidden_size: The number of features in the hidden state `h`
bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
Default: ``True``
- nonlinearity: The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'
+ nonlinearity: The non-linearity to use. Can be either ``'tanh'`` or ``'relu'``. Default: ``'tanh'``
Inputs: input, hidden
- **input** of shape `(batch, input_size)`: tensor containing input features
Attributes:
weight_ih: the learnable input-hidden weights, of shape
- `(hidden_size x input_size)`
+ `(hidden_size, input_size)`
weight_hh: the learnable hidden-hidden weights, of shape
- `(hidden_size x hidden_size)`
+ `(hidden_size, hidden_size)`
bias_ih: the learnable input-hidden bias, of shape `(hidden_size)`
bias_hh: the learnable hidden-hidden bias, of shape `(hidden_size)`
h' = o * \tanh(c') \\
\end{array}
- where :math:`\sigma` is the sigmoid function.
+ where :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
Args:
input_size: The number of expected features in the input `x`
hidden_size: The number of features in the hidden state `h`
- bias: If `False`, then the layer does not use bias weights `b_ih` and
+ bias: If ``False``, then the layer does not use bias weights `b_ih` and
`b_hh`. Default: ``True``
Inputs: input, (h_0, c_0)
If `(h_0, c_0)` is not provided, both **h_0** and **c_0** default to zero.
- Outputs: h_1, c_1
+ Outputs: (h_1, c_1)
- **h_1** of shape `(batch, hidden_size)`: tensor containing the next hidden state
for each element in the batch
- **c_1** of shape `(batch, hidden_size)`: tensor containing the next cell state
Attributes:
weight_ih: the learnable input-hidden weights, of shape
- `(4*hidden_size x input_size)`
+ `(4*hidden_size, input_size)`
weight_hh: the learnable hidden-hidden weights, of shape
- `(4*hidden_size x hidden_size)`
+ `(4*hidden_size, hidden_size)`
bias_ih: the learnable input-hidden bias, of shape `(4*hidden_size)`
bias_hh: the learnable hidden-hidden bias, of shape `(4*hidden_size)`
h' = (1 - z) * n + z * h
\end{array}
- where :math:`\sigma` is the sigmoid function.
+ where :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
Args:
input_size: The number of expected features in the input `x`
hidden_size: The number of features in the hidden state `h`
- bias: If `False`, then the layer does not use bias weights `b_ih` and
- `b_hh`. Default: `True`
+ bias: If ``False``, then the layer does not use bias weights `b_ih` and
+ `b_hh`. Default: ``True``
Inputs: input, hidden
- **input** of shape `(batch, input_size)`: tensor containing input features
Attributes:
weight_ih: the learnable input-hidden weights, of shape
- `(3*hidden_size x input_size)`
+ `(3*hidden_size, input_size)`
weight_hh: the learnable hidden-hidden weights, of shape
- `(3*hidden_size x hidden_size)`
+ `(3*hidden_size, hidden_size)`
bias_ih: the learnable input-hidden bias, of shape `(3*hidden_size)`
bias_hh: the learnable hidden-hidden bias, of shape `(3*hidden_size)`
Args:
embeddings (Tensor): FloatTensor containing weights for the Embedding.
- First dimension is being passed to Embedding as 'num_embeddings', second as 'embedding_dim'.
+ First dimension is being passed to Embedding as ``num_embeddings``, second as ``embedding_dim``.
freeze (boolean, optional): If ``True``, the tensor does not get updated in the learning process.
Equivalent to ``embedding.weight.requires_grad = False``. Default: ``True``
padding_idx (int, optional): See module initialization documentation.
For bags of constant length, this class
- * with ``mode="sum"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.sum(dim=1)``,
- * with ``mode="mean"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.mean(dim=1)``,
- * with ``mode="max"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.max(dim=1)``.
+ * with ``mode="sum"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.sum(dim=0)``,
+ * with ``mode="mean"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.mean(dim=0)``,
+ * with ``mode="max"`` is equivalent to :class:`~torch.nn.Embedding` followed by ``torch.max(dim=0)``.
However, :class:`~torch.nn.EmbeddingBag` is much more time and memory efficient than using a chain of these
operations.
supported when ``mode="max"``.
Attributes:
- weight (Tensor): the learnable weights of the module of shape ``(num_embeddings x embedding_dim)``
+ weight (Tensor): the learnable weights of the module of shape `(num_embeddings, embedding_dim)`
initialized from :math:`\mathcal{N}(0, 1)`.
Inputs: :attr:`input` (LongTensor) and :attr:`offsets` (LongTensor, optional)
- - If :attr:`input` is 2D of shape ``B x N``,
+ - If :attr:`input` is 2D of shape `(B, N)`,
it will be treated as ``B`` bags (sequences) each of fixed length ``N``, and
this will return ``B`` values aggregated in a way depending on the :attr:`mode`.
:attr:`offsets` is ignored and required to be ``None`` in this case.
- - If :attr:`input` is 1D of shape ``N``,
+ - If :attr:`input` is 1D of shape `(N)`,
it will be treated as a concatenation of multiple bags (sequences).
:attr:`offsets` is required to be a 1D tensor containing the
starting index positions of each bag in :attr:`input`. Therefore,
- for :attr:`offsets` of shape ``B``, :attr:`input` will be viewed as
+ for :attr:`offsets` of shape `(B)`, :attr:`input` will be viewed as
having ``B`` bags. Empty bags (i.e., having 0-length) will have
returned vectors filled by zeros.
- Output shape: ``B x embedding_dim``
+ Output shape: `(B, embedding_dim)`
Examples::