Standardize bib references and Examples subsection in docstrings.

author Dustin Tran <trandustin@google.com>

Tue, 20 Mar 2018 00:41:25 +0000 (17:41 -0700)

committer TensorFlower Gardener <gardener@tensorflow.org>

Tue, 20 Mar 2018 00:45:20 +0000 (17:45 -0700)
author Dustin Tran <trandustin@google.com>
Tue, 20 Mar 2018 00:41:25 +0000 (17:41 -0700)
committer TensorFlower Gardener <gardener@tensorflow.org>
Tue, 20 Mar 2018 00:45:20 +0000 (17:45 -0700)
diff --git a/tensorflow/contrib/distributions/python/ops/autoregressive.py b/tensorflow/contrib/distributions/python/ops/autoregressive.py

index 852298b..69f3d57 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/autoregressive.py
@@ -36,7 +36,8 @@ class Autoregressive(distribution_lib.Distribution):
      "Autoregressive models decompose the joint density as a product of
      conditionals, and model each conditional in turn. Normalizing flows
      transform a base density (e.g. a standard Gaussian) into the target density
-    by an invertible transformation with tractable Jacobian." [1]
+    by an invertible transformation with tractable Jacobian." [(Papamakarios et
+    al., 2016)][1]
  
    In other words, the "autoregressive property" is equivalent to the
    decomposition, `p(x) = prod{ p(x[i] | x[0:i]) : i=0, ..., d }`. The provided
@@ -45,17 +46,18 @@ class Autoregressive(distribution_lib.Distribution):
  
    Practically speaking the autoregressive property means that there exists a
    permutation of the event coordinates such that each coordinate is a
-  diffeomorphic function of only preceding coordinates. [2]
+  diffeomorphic function of only preceding coordinates
+  [(van den Oord et al., 2016)][2].
  
    #### Mathematical Details
  
-  The probability function is,
+  The probability function is
  
    ```none
    prob(x; fn, n) = fn(x).prob(x)
    ```
  
-  And a sample is generated by,
+  And a sample is generated by
  
    ```none
    x = fn(...fn(fn(x0).sample()).sample()).sample()
@@ -93,13 +95,15 @@ class Autoregressive(distribution_lib.Distribution):
  
    ```
  
-  [1]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  #### References
  
-  [2]: "Conditional Image Generation with PixelCNN Decoders."
-       Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex
-       Graves, Koray Kavukcuoglu. Arxiv, 2016.
+  [1]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
+
+  [2]: Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt,
+       Alex Graves, and Koray Kavukcuoglu. Conditional Image Generation with
+       PixelCNN Decoders. In _Neural Information Processing Systems_, 2016.
         https://arxiv.org/abs/1606.05328
    """
  
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py

index 7fe73ad..bef7bbb 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/affine.py
@@ -62,7 +62,7 @@ class Affine(bijector.Bijector):
    matrices, i.e., the matmul is [matrix-free](
    https://en.wikipedia.org/wiki/Matrix-free_methods) when possible.
  
-  Examples:
+  #### Examples
  
    ```python
    # Y = X
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py

index be72ff3..33fdd32 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py
@@ -76,15 +76,16 @@ def _undo_batch_normalization(x,
  class BatchNormalization(bijector.Bijector):
    """Compute `Y = g(X) s.t. X = g^-1(Y) = (Y - mean(Y)) / std(Y)`.
  
-  Applies Batch Normalization [1] to samples from a data distribution. This can
-  be used to stabilize training of normalizing flows [2, 3].
+  Applies Batch Normalization [(Ioffe and Szegedy, 2015)][1] to samples from a
+  data distribution. This can be used to stabilize training of normalizing
+  flows ([Papamakarios et al., 2016][3]; [Dinh et al., 2017][2])
  
    When training Deep Neural Networks (DNNs), it is common practice to
    normalize or whiten features by shifting them to have zero mean and
    scaling them to have unit variance.
  
-  The `inverse()` method of the BatchNorm bijector, which is used in the
-  log-likelihood computation of data samples, implements the normalization
+  The `inverse()` method of the `BatchNormalization` bijector, which is used in
+  the log-likelihood computation of data samples, implements the normalization
    procedure (shift-and-scale) using the mean and standard deviation of the
    current minibatch.
  
@@ -92,7 +93,6 @@ class BatchNormalization(bijector.Bijector):
    `X*std(Y) + mean(Y)` with the running-average mean and standard deviation
    computed at training-time. De-normalization is useful for sampling.
  
-
    ```python
  
    dist = tfd.TransformedDistribution(
@@ -112,19 +112,20 @@ class BatchNormalization(bijector.Bijector):
    `BatchNorm.forward(BatchNorm.inverse(...))` will be identical when
    `training=False` but may be different when `training=True`.
  
-  [1]: "Batch Normalization: Accelerating Deep Network Training by Reducing
-       Internal Covariate Shift."
-       Sergey Ioffe, Christian Szegedy. Arxiv. 2015.
-       https://arxiv.org/abs/1502.03167
+  #### References
  
-  [2]: "Density Estimation using Real NVP."
-     Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio. ICLR. 2017.
-     https://arxiv.org/abs/1605.08803
+  [1]: Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating
+       Deep Network Training by Reducing Internal Covariate Shift. In
+       _International Conference on Machine Learning_, 2015.
+       https://arxiv.org/abs/1502.03167
  
-  [3]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  [2]: Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation
+       using Real NVP. In _International Conference on Learning
+       Representations_, 2017. https://arxiv.org/abs/1605.08803
  
+  [3]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
    """
  
    def __init__(self,
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py

index 43208ff..8f09e16 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py
@@ -57,7 +57,7 @@ class CholeskyOuterProduct(bijector.Bijector):
    that, if `I = L_3 @ L_3.T`, with L_3 being lower-triangular with positive-
    diagonal, then `L_3 = I`. Thus, `L_1 = L_2`, proving injectivity of g.
  
-  Examples:
+  #### Examples
  
    ```python
    bijector.CholeskyOuterProduct().forward(x=[[1., 0], [2, 1]])
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py

index 5251dbc..84b2340 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
@@ -45,14 +45,15 @@ __all__ = [
  class MaskedAutoregressiveFlow(bijector_lib.Bijector):
    """Affine MaskedAutoregressiveFlow bijector for vector-valued events.
  
-  The affine autoregressive flow [1] provides a relatively simple framework for
-  user-specified (deep) architectures to learn a distribution over vector-valued
-  events. Regarding terminology,
+  The affine autoregressive flow [(Papamakarios et al., 2016)][3] provides a
+  relatively simple framework for user-specified (deep) architectures to learn
+  a distribution over vector-valued events. Regarding terminology,
  
      "Autoregressive models decompose the joint density as a product of
      conditionals, and model each conditional in turn. Normalizing flows
      transform a base density (e.g. a standard Gaussian) into the target density
-    by an invertible transformation with tractable Jacobian." [1]
+    by an invertible transformation with tractable Jacobian."
+    [(Papamakarios et al., 2016)][3]
  
    In other words, the "autoregressive property" is equivalent to the
    decomposition, `p(x) = prod{ p(x[i] | x[0:i]) : i=0, ..., d }`. The provided
@@ -75,26 +76,26 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
  
    Given a `shift_and_log_scale_fn`, the forward and inverse transformations are
    (a sequence of) affine transformations. A "valid" `shift_and_log_scale_fn`
-  must compute each `shift` (aka `loc` or "mu" [2]) and `log(scale)` (aka
-  "alpha" [2]) such that each are broadcastable with the arguments to `forward`
-  and `inverse`, i.e., such that the calculations in `forward`, `inverse`
-  [below] are possible.
+  must compute each `shift` (aka `loc` or "mu" in [Germain et al. (2015)][1])
+  and `log(scale)` (aka "alpha" in [Germain et al. (2015)][1]) such that each
+  are broadcastable with the arguments to `forward` and `inverse`, i.e., such
+  that the calculations in `forward`, `inverse` [below] are possible.
  
    For convenience, `masked_autoregressive_default_template` is offered as a
    possible `shift_and_log_scale_fn` function. It implements the MADE
-  architecture [2]. MADE is a feed-forward network that computes a `shift` and
-  `log(scale)` using `masked_dense` layers in a deep neural network. Weights are
-  masked to ensure the autoregressive property. It is possible that this
-  architecture is suboptimal for your task. To build alternative networks,
-  either change the arguments to `masked_autoregressive_default_template`, use
-  the `masked_dense` function to roll-out your own, or use some other
-  architecture, e.g., using `tf.layers`.
+  architecture [(Germain et al., 2015)][1]. MADE is a feed-forward network that
+  computes a `shift` and `log(scale)` using `masked_dense` layers in a deep
+  neural network. Weights are masked to ensure the autoregressive property. It
+  is possible that this architecture is suboptimal for your task. To build
+  alternative networks, either change the arguments to
+  `masked_autoregressive_default_template`, use the `masked_dense` function to
+  roll-out your own, or use some other architecture, e.g., using `tf.layers`.
  
    Warning: no attempt is made to validate that the `shift_and_log_scale_fn`
    enforces the "autoregressive property".
  
    Assuming `shift_and_log_scale_fn` has valid shape and autoregressive
-  semantics, the forward transformation is,
+  semantics, the forward transformation is
  
    ```python
    def forward(x):
@@ -106,7 +107,7 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
      return y
    ```
  
-  and the inverse transformation is,
+  and the inverse transformation is
  
    ```python
    def inverse(y):
@@ -121,7 +122,7 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
    the "last" `y` used to compute `shift`, `log_scale`. (Roughly speaking, this
    also proves the transform is bijective.)
  
-  #### Example Use
+  #### Examples
  
    ```python
    tfd = tf.contrib.distributions
@@ -142,7 +143,8 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
    maf.log_prob(x)   # Almost free; uses Bijector caching.
    maf.log_prob(0.)  # Cheap; no `tf.while_loop` despite no Bijector caching.
  
-  # [1] also describes an "Inverse Autoregressive Flow", e.g.,
+  # [Papamakarios et al. (2016)][3] also describe an Inverse Autoregressive
+  # Flow [(Kingma et al., 2016)][2]:
    iaf = tfd.TransformedDistribution(
        distribution=tfd.Normal(loc=0., scale=1.),
        bijector=tfb.Invert(tfb.MaskedAutoregressiveFlow(
@@ -168,14 +170,20 @@ class MaskedAutoregressiveFlow(bijector_lib.Bijector):
        event_shape=[dims])
    ```
  
-  [1]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  #### References
  
-  [2]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
  
+  [2]: Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya
+       Sutskever, and Max Welling. Improving Variational Inference with Inverse
+       Autoregressive Flow. In _Neural Information Processing Systems_, 2016.
+       https://arxiv.org/abs/1606.04934
+
+  [3]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
    """
  
    def __init__(self,
@@ -329,11 +337,7 @@ def masked_dense(inputs,
                   **kwargs):
    """A autoregressively masked dense layer. Analogous to `tf.layers.dense`.
  
-  See [1] for detailed explanation.
-
-  [1]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
+  See [Germain et al. (2015)][1] for detailed explanation.
  
    Arguments:
      inputs: Tensor input.
@@ -358,6 +362,12 @@ def masked_dense(inputs,
    Raises:
      NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
        graph execution.
+
+  #### References
+
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
    """
    # TODO(b/67594795): Better support of dynamic shape.
    input_depth = inputs.shape.with_rank_at_least(1)[-1].value
@@ -398,23 +408,24 @@ def masked_autoregressive_default_template(
      name=None,
      *args,
      **kwargs):
-  """Build the MADE Model [1].
+  """Build the Masked Autoregressive Density Estimator (Germain et al., 2015).
  
    This will be wrapped in a make_template to ensure the variables are only
-  created once. It takes the input and returns the `loc` ("mu" [1]) and
-  `log_scale` ("alpha" [1]) from the MADE network.
+  created once. It takes the input and returns the `loc` ("mu" in [Germain et
+  al. (2015)][1]) and `log_scale` ("alpha" in [Germain et al. (2015)][1]) from
+  the MADE network.
  
    Warning: This function uses `masked_dense` to create randomly initialized
    `tf.Variables`. It is presumed that these will be fit, just as you would any
    other neural architecture which uses `tf.layers.dense`.
  
-  #### About Hidden Layers:
+  #### About Hidden Layers
  
    Each element of `hidden_layers` should be greater than the `input_depth`
    (i.e., `input_depth = tf.shape(input)[-1]` where `input` is the input to the
    neural network). This is necessary to ensure the autoregressivity property.
  
-  #### About Clipping:
+  #### About Clipping
  
    This function also optionally clips the `log_scale` (but possibly not its
    gradient). This is useful because if `log_scale` is too small/large it might
@@ -427,11 +438,7 @@ def masked_autoregressive_default_template(
    `grad[exp(clip(x))] = grad[x] exp(clip(x))` rather than the usual
    `grad[clip(x)] exp(clip(x))`.
  
-  [1]: "MADE: Masked Autoencoder for Distribution Estimation."
-       Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML. 2015.
-       https://arxiv.org/abs/1502.03509
-
-  Arguments:
+  Args:
      hidden_layers: Python `list`-like of non-negative integer, scalars
        indicating the number of units in each hidden layer. Default: `[512, 512].
      shift_only: Python `bool` indicating if only the `shift` term shall be
@@ -450,12 +457,20 @@ def masked_autoregressive_default_template(
      **kwargs: `tf.layers.dense` keyword arguments.
  
    Returns:
-    shift: `Float`-like `Tensor` of shift terms (the "mu" in [2]).
-    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in [2]).
+    shift: `Float`-like `Tensor` of shift terms (the "mu" in
+      [Germain et al.  (2015)][1]).
+    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in
+      [Germain et al. (2015)][1]).
  
    Raises:
      NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
        graph execution.
+
+  #### References
+
+  [1]: Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE:
+       Masked Autoencoder for Distribution Estimation. In _International
+       Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
    """
  
    with ops.name_scope(name, "masked_autoregressive_default_template",
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py

index 2840f52..71ab369 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py
@@ -38,7 +38,7 @@ class RealNVP(bijector_lib.Bijector):
    """RealNVP "affine coupling layer" for vector-valued events.
  
    Real NVP models a normalizing flow on a `D`-dimensional distribution via a
-  single `D-d`-dimensional conditional distribution [1]:
+  single `D-d`-dimensional conditional distribution [(Dinh et al., 2017)][1]:
  
    `y[d:D] = y[d:D] * math_ops.exp(log_scale_fn(y[d:D])) + shift_fn(y[d:D])`
    `y[0:d] = x[0:d]`
@@ -51,31 +51,34 @@ class RealNVP(bijector_lib.Bijector):
  
    Masking is currently only supported for base distributions with
    `event_ndims=1`. For more sophisticated masking schemes like checkerboard or
-  channel-wise masking [2], use the `tfb.Permute` bijector to re-order desired
-  masked units into the first `d` units. For base distributions with
-  `event_ndims > 1`, use the `tfb.Reshape` bijector to flatten the event shape.
-
-  Recall that the MAF bijector [2] implements a normalizing flow via an
-  autoregressive transformation. MAF and IAF have opposite computational
-  tradeoffs - MAF can train all units in parallel but must sample units
-  sequentially, while IAF must train units sequentially but can sample in
-  parallel. In contrast, Real NVP can compute both forward and inverse
-  computations in parallel. However, the lack of an autoregressive
+  channel-wise masking [(Papamakarios et al., 2016)[4], use the `tfb.Permute`
+  bijector to re-order desired masked units into the first `d` units. For base
+  distributions with `event_ndims > 1`, use the `tfb.Reshape` bijector to
+  flatten the event shape.
+
+  Recall that the MAF bijector [(Papamakarios et al., 2016)][4] implements a
+  normalizing flow via an autoregressive transformation. MAF and IAF have
+  opposite computational tradeoffs - MAF can train all units in parallel but
+  must sample units sequentially, while IAF must train units sequentially but
+  can sample in parallel. In contrast, Real NVP can compute both forward and
+  inverse computations in parallel. However, the lack of an autoregressive
    transformations makes it less expressive on a per-bijector basis.
  
    A "valid" `shift_and_log_scale_fn` must compute each `shift` (aka `loc` or
-  "mu" [2]) and `log(scale)` (aka "alpha" [2]) such that each are broadcastable
-  with the arguments to `forward` and `inverse`, i.e., such that the
-  calculations in `forward`, `inverse` [below] are possible. For convenience,
+  "mu" in [Papamakarios et al. (2016)][4]) and `log(scale)` (aka "alpha" in
+  [Papamakarios et al. (2016)][4]) such that each are broadcastable with the
+  arguments to `forward` and `inverse`, i.e., such that the calculations in
+  `forward`, `inverse` [below] are possible. For convenience,
    `real_nvp_default_nvp` is offered as a possible `shift_and_log_scale_fn`
    function.
  
-  NICE [3] is a special case of the Real NVP bijector which discards the scale
-  transformation, resulting in a constant-time inverse-log-determinant-Jacobian.
-  To use a NICE bijector instead of Real NVP, `shift_and_log_scale_fn` should
-  return `(shift, None)`, and `is_constant_jacobian` should be set to `True` in
-  the `RealNVP` constructor. Calling `real_nvp_default_template` with
-  `shift_only=True` returns one such NICE-compatible `shift_and_log_scale_fn`.
+  NICE [(Dinh et al., 2014)][2] is a special case of the Real NVP bijector
+  which discards the scale transformation, resulting in a constant-time
+  inverse-log-determinant-Jacobian. To use a NICE bijector instead of Real
+  NVP, `shift_and_log_scale_fn` should return `(shift, None)`, and
+  `is_constant_jacobian` should be set to `True` in the `RealNVP` constructor.
+  Calling `real_nvp_default_template` with `shift_only=True` returns one such
+  NICE-compatible `shift_and_log_scale_fn`.
  
    Caching: the scalar input depth `D` of the base distribution is not known at
    construction time. The first call to any of `forward(x)`, `inverse(x)`,
@@ -103,23 +106,24 @@ class RealNVP(bijector_lib.Bijector):
    nvp.log_prob(0.)
    ```
  
-  For more examples, see [4].
+  For more examples, see [Jang (2018)][3].
  
-  [1]: "Density Estimation using Real NVP."
-       Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio. ICLR. 2017.
-       https://arxiv.org/abs/1605.08803
+  #### References
  
-  [2]: "Masked Autoregressive Flow for Density Estimation."
-       George Papamakarios, Theo Pavlakou, Iain Murray. Arxiv. 2017.
-       https://arxiv.org/abs/1705.07057
+  [1]: Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation
+       using Real NVP. In _International Conference on Learning
+       Representations_, 2017. https://arxiv.org/abs/1605.08803
  
-  [3]: "NICE: Non-linear Independent Components Estimation."
-       Laurent Dinh, David Krueger, Yoshua Bengio. ICLR. 2015.
-       https://arxiv.org/abs/1410.8516
+  [2]: Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear
+       Independent Components Estimation. _arXiv preprint arXiv:1410.8516_,
+       2014. https://arxiv.org/abs/1410.8516
  
-  [4]: "Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows."
-       Eric Jang. Blog post. January 2018.
-       http://blog.evjang.com/2018/01/nf2.html
+  [3]: Eric Jang. Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows.
+       _Technical Report_, 2018. http://blog.evjang.com/2018/01/nf2.html
+
+  [4]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
    """
  
    def __init__(self,
@@ -250,12 +254,20 @@ def real_nvp_default_template(
      **kwargs: `tf.layers.dense` keyword arguments.
  
    Returns:
-    shift: `Float`-like `Tensor` of shift terms (the "mu" in [2]).
-    log_scale: `Float`-like `Tensor` of log(scale) terms (the "alpha" in [2]).
+    shift: `Float`-like `Tensor` of shift terms ("mu" in
+      [Papamakarios et al.  (2016)][1]).
+    log_scale: `Float`-like `Tensor` of log(scale) terms ("alpha" in
+      [Papamakarios et al. (2016)][1]).
  
    Raises:
      NotImplementedError: if rightmost dimension of `inputs` is unknown prior to
        graph execution.
+
+  #### References
+
+  [1]: George Papamakarios, Theo Pavlakou, and Iain Murray. Masked
+       Autoregressive Flow for Density Estimation. In _Neural Information
+       Processing Systems_, 2017. https://arxiv.org/abs/1705.07057
    """
  
    with ops.name_scope(name, "real_nvp_default_template"):
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/square.py b/tensorflow/contrib/distributions/python/ops/bijectors/square.py

index 2831a92..1e9dbf3 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/bijectors/square.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/square.py
@@ -37,7 +37,7 @@ class Square(bijector.Bijector):
    g is a bijection between the non-negative real numbers (R_+) and the
    non-negative real numbers.
  
-  Examples:
+  #### Examples
  
    ```python
    bijector.Square().forward(x=[[1., 0], [2, 1]])
diff --git a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py

index 120b38d..192dede 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
+++ b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
@@ -44,18 +44,16 @@ _kumaraswamy_sample_note = """Note: `x` must have dtype `self.dtype` and be in
  def _harmonic_number(x):
    """Compute the harmonic number from its analytic continuation.
  
-  Derivation from [1] and Euler's constant [2].
-  [1] -
-  https://en.wikipedia.org/wiki/Digamma_function#Relation_to_harmonic_numbers
-  [2] - https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant
-
+  Derivation from [here](
+  https://en.wikipedia.org/wiki/Digamma_function#Relation_to_harmonic_numbers)
+  and [Euler's constant](
+  https://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant).
  
    Args:
      x: input float.
  
    Returns:
      z: The analytic continuation of the harmonic number for the input.
-
    """
    one = array_ops.ones([], dtype=x.dtype)
    return math_ops.digamma(x + one) - math_ops.digamma(one)
diff --git a/tensorflow/contrib/distributions/python/ops/moving_stats.py b/tensorflow/contrib/distributions/python/ops/moving_stats.py

index 20f8564..87d4080 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/moving_stats.py
+++ b/tensorflow/contrib/distributions/python/ops/moving_stats.py
@@ -47,9 +47,7 @@ def assign_moving_mean_variance(
    Note: `mean_var` is updated *after* `variance_var`, i.e., `variance_var` uses
    the lag-1 mean.
  
-  For derivation justification, see equation 143 of:
-    T. Finch, Feb 2009. "Incremental calculation of weighted mean and variance".
-    http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
+  For derivation justification, see [Finch (2009; Eq. 143)][1].
  
    Args:
      mean_var: `float`-like `Variable` representing the exponentially weighted
@@ -72,6 +70,12 @@ def assign_moving_mean_variance(
      TypeError: if `mean_var` does not have float type `dtype`.
      TypeError: if `mean_var`, `variance_var`, `value`, `decay` have different
        `base_dtype`.
+
+  #### References
+
+  [1]: Tony Finch. Incremental calculation of weighted mean and variance.
+       _Technical Report_, 2009.
+       http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
    """
    with ops.name_scope(name, "assign_moving_mean_variance",
                        [variance_var, mean_var, value, decay]):
@@ -183,9 +187,7 @@ def moving_mean_variance(value, decay, collections=None, name=None):
    Note: `mean_var` is updated *after* `variance_var`, i.e., `variance_var` uses
    the lag-`1` mean.
  
-  For derivation justification, see equation 143 of:
-    T. Finch, Feb 2009. "Incremental calculation of weighted mean and variance".
-    http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
+  For derivation justification, see [Finch (2009; Eq. 143)][1].
  
    Unlike `assign_moving_mean_variance`, this function handles
    variable creation.
@@ -208,6 +210,12 @@ def moving_mean_variance(value, decay, collections=None, name=None):
    Raises:
      TypeError: if `value_var` does not have float type `dtype`.
      TypeError: if `value`, `decay` have different `base_dtype`.
+
+  #### References
+
+  [1]: Tony Finch. Incremental calculation of weighted mean and variance.
+       _Technical Report_, 2009.
+       http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
    """
    if collections is None:
      collections = [ops.GraphKeys.GLOBAL_VARIABLES]
diff --git a/tensorflow/contrib/distributions/python/ops/shape.py b/tensorflow/contrib/distributions/python/ops/shape.py

index 5fb6f0c..bac0b79 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/shape.py
+++ b/tensorflow/contrib/distributions/python/ops/shape.py
@@ -32,45 +32,50 @@ from tensorflow.python.ops.distributions import util as distribution_util
  class _DistributionShape(object):
    """Manage and manipulate `Distribution` shape.
  
-  Terminology:
-    Recall that a `Tensor` has:
-      - `shape`: size of `Tensor` dimensions,
-      - `ndims`: size of `shape`; number of `Tensor` dimensions,
-      - `dims`: indexes into `shape`; useful for transpose, reduce.
-
-    `Tensor`s sampled from a `Distribution` can be partitioned by `sample_dims`,
-    `batch_dims`, and `event_dims`. To understand the semantics of these
-    dimensions, consider when two of the three are fixed and the remaining
-    is varied:
-      - `sample_dims`: indexes independent draws from identical
-                       parameterizations of the `Distribution`.
-      - `batch_dims`:  indexes independent draws from non-identical
-                       parameterizations of the `Distribution`.
-      - `event_dims`:  indexes event coordinates from one sample.
-
-    The `sample`, `batch`, and `event` dimensions constitute the entirety of a
-    `Distribution` `Tensor`'s shape.
-
-    The dimensions are always in `sample`, `batch`, `event` order.
-
-  Purpose:
-    This class partitions `Tensor` notions of `shape`, `ndims`, and `dims` into
-    `Distribution` notions of `sample,` `batch,` and `event` dimensions. That
-    is, it computes any of:
+  #### Terminology
  
-    ```
-    sample_shape     batch_shape     event_shape
-    sample_dims      batch_dims      event_dims
-    sample_ndims     batch_ndims     event_ndims
-    ```
+  Recall that a `Tensor` has:
+    - `shape`: size of `Tensor` dimensions,
+    - `ndims`: size of `shape`; number of `Tensor` dimensions,
+    - `dims`: indexes into `shape`; useful for transpose, reduce.
+
+  `Tensor`s sampled from a `Distribution` can be partitioned by `sample_dims`,
+  `batch_dims`, and `event_dims`. To understand the semantics of these
+  dimensions, consider when two of the three are fixed and the remaining
+  is varied:
+    - `sample_dims`: indexes independent draws from identical
+                     parameterizations of the `Distribution`.
+    - `batch_dims`:  indexes independent draws from non-identical
+                     parameterizations of the `Distribution`.
+    - `event_dims`:  indexes event coordinates from one sample.
+
+  The `sample`, `batch`, and `event` dimensions constitute the entirety of a
+  `Distribution` `Tensor`'s shape.
+
+  The dimensions are always in `sample`, `batch`, `event` order.
+
+  #### Purpose
+
+  This class partitions `Tensor` notions of `shape`, `ndims`, and `dims` into
+  `Distribution` notions of `sample,` `batch,` and `event` dimensions. That
+  is, it computes any of:
+
+  ```
+  sample_shape     batch_shape     event_shape
+  sample_dims      batch_dims      event_dims
+  sample_ndims     batch_ndims     event_ndims
+  ```
  
-    for a given `Tensor`, e.g., the result of
-    `Distribution.sample(sample_shape=...)`.
+  for a given `Tensor`, e.g., the result of
+  `Distribution.sample(sample_shape=...)`.
  
-    For a given `Tensor`, this class computes the above table using minimal
-    information: `batch_ndims` and `event_ndims`.
+  For a given `Tensor`, this class computes the above table using minimal
+  information: `batch_ndims` and `event_ndims`.
+
+  #### Examples
+
+  We show examples of distribution shape semantics.
  
-  Examples of `Distribution` `shape` semantics:
      - Sample dimensions:
        Computing summary statistics, i.e., the average is a reduction over sample
        dimensions.
@@ -111,52 +116,54 @@ class _DistributionShape(object):
        tf.div(1., tf.reduce_prod(x, event_dims))
        ```
  
-  Examples using this class:
-    Write `S, B, E` for `sample_shape`, `batch_shape`, and `event_shape`.
-
-    ```python
-    # 150 iid samples from one multivariate Normal with two degrees of freedom.
-    mu = [0., 0]
-    sigma = [[1., 0],
-             [0,  1]]
-    mvn = MultivariateNormal(mu, sigma)
-    rand_mvn = mvn.sample(sample_shape=[3, 50])
-    shaper = DistributionShape(batch_ndims=0, event_ndims=1)
-    S, B, E = shaper.get_shape(rand_mvn)
-    # S = [3, 50]
-    # B = []
-    # E = [2]
-
-    # 12 iid samples from one Wishart with 2x2 events.
-    sigma = [[1., 0],
-             [2,  1]]
-    wishart = Wishart(df=5, scale=sigma)
-    rand_wishart = wishart.sample(sample_shape=[3, 4])
-    shaper = DistributionShape(batch_ndims=0, event_ndims=2)
-    S, B, E = shaper.get_shape(rand_wishart)
-    # S = [3, 4]
-    # B = []
-    # E = [2, 2]
-
-    # 100 iid samples from two, non-identical trivariate Normal distributions.
-    mu    = ...  # shape(2, 3)
-    sigma = ...  # shape(2, 3, 3)
-    X = MultivariateNormal(mu, sigma).sample(shape=[4, 25])
-    # S = [4, 25]
-    # B = [2]
-    # E = [3]
-    ```
-
-  Argument Validation:
-    When `validate_args=False`, checks that cannot be done during
-    graph construction are performed at graph execution. This may result in a
-    performance degradation because data must be switched from GPU to CPU.
-
-    For example, when `validate_args=False` and `event_ndims` is a
-    non-constant `Tensor`, it is checked to be a non-negative integer at graph
-    execution. (Same for `batch_ndims`). Constant `Tensor`s and non-`Tensor`
-    arguments are always checked for correctness since this can be done for
-    "free," i.e., during graph construction.
+  We show examples using this class.
+
+  Write `S, B, E` for `sample_shape`, `batch_shape`, and `event_shape`.
+
+  ```python
+  # 150 iid samples from one multivariate Normal with two degrees of freedom.
+  mu = [0., 0]
+  sigma = [[1., 0],
+           [0,  1]]
+  mvn = MultivariateNormal(mu, sigma)
+  rand_mvn = mvn.sample(sample_shape=[3, 50])
+  shaper = DistributionShape(batch_ndims=0, event_ndims=1)
+  S, B, E = shaper.get_shape(rand_mvn)
+  # S = [3, 50]
+  # B = []
+  # E = [2]
+
+  # 12 iid samples from one Wishart with 2x2 events.
+  sigma = [[1., 0],
+           [2,  1]]
+  wishart = Wishart(df=5, scale=sigma)
+  rand_wishart = wishart.sample(sample_shape=[3, 4])
+  shaper = DistributionShape(batch_ndims=0, event_ndims=2)
+  S, B, E = shaper.get_shape(rand_wishart)
+  # S = [3, 4]
+  # B = []
+  # E = [2, 2]
+
+  # 100 iid samples from two, non-identical trivariate Normal distributions.
+  mu    = ...  # shape(2, 3)
+  sigma = ...  # shape(2, 3, 3)
+  X = MultivariateNormal(mu, sigma).sample(shape=[4, 25])
+  # S = [4, 25]
+  # B = [2]
+  # E = [3]
+  ```
+
+  #### Argument Validation
+
+  When `validate_args=False`, checks that cannot be done during
+  graph construction are performed at graph execution. This may result in a
+  performance degradation because data must be switched from GPU to CPU.
+
+  For example, when `validate_args=False` and `event_ndims` is a
+  non-constant `Tensor`, it is checked to be a non-negative integer at graph
+  execution. (Same for `batch_ndims`). Constant `Tensor`s and non-`Tensor`
+  arguments are always checked for correctness since this can be done for
+  "free," i.e., during graph construction.
    """
  
    def __init__(self,
diff --git a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py

index 3208ecd..971d65c 100644 (file)
--- a/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py
@@ -248,11 +248,7 @@ class VectorDiffeomixture(distribution_lib.Distribution):
    The default quadrature scheme chooses `z_{N, n}` as `N` midpoints of
    the quantiles of `p(z)` (generalized quantiles if `K > 2`).
  
-  See [1] for more details.
-
-  [1]. "Quadrature Compound: An approximating family of distributions"
-       Joshua Dillon, Ian Langmore, arXiv preprints
-       https://arxiv.org/abs/1801.03080
+  See [Dillon and Langmore (2018)][1] for more details.
  
    #### About `Vector` distributions in TensorFlow.
  
@@ -313,6 +309,13 @@ class VectorDiffeomixture(distribution_lib.Distribution):
              is_positive_definite=True),
        ],
        validate_args=True)
+  ```
+
+  #### References
+
+  [1]: Joshua Dillon and Ian Langmore. Quadrature Compound: An approximating
+       family of distributions. _arXiv preprint arXiv:1801.03080_, 2018.
+       https://arxiv.org/abs/1801.03080
    """
  
    def __init__(self,
author	Dustin Tran <trandustin@google.com>
	Tue, 20 Mar 2018 00:41:25 +0000 (17:41 -0700)
committer	TensorFlower Gardener <gardener@tensorflow.org>
	Tue, 20 Mar 2018 00:45:20 +0000 (17:45 -0700)
tensorflow/contrib/distributions/python/ops/autoregressive.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/affine.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/batch_normalization.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/cholesky_outer_product.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/real_nvp.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/bijectors/square.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/kumaraswamy.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/moving_stats.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/shape.py		patch \| blob \| history
tensorflow/contrib/distributions/python/ops/vector_diffeomixture.py		patch \| blob \| history