From df614371c7e5cc6788863606841f2a97834aa616 Mon Sep 17 00:00:00 2001 From: Xiang Gao Date: Sat, 15 Dec 2018 00:07:37 -0800 Subject: [PATCH] Mention Jacobian-vector product in the doc of torch.autograd (#15197) Summary: A friend of me is learning deep learning and pytorch, and he is confused by the following piece of code from the tutorial https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients : ```python x = torch.randn(3, requires_grad=True) y = x * 2 while y.data.norm() < 1000: y = y * 2 print(y) gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float) y.backward(gradients) print(x.grad) ``` He don't know where the following line comes from: ```python gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float) ``` What are we computing? Why don't we compute "the gradient of `y` w.r.t `x`"? In the tutorial, it only says > You can do many crazy things with autograd! Which does not explain anything. It seems to be hard for some beginners of deep learning to understand why do we ever do backwards with external gradient fed in and what is the meaning of doing so. So I modified the tutorial in https://github.com/pytorch/tutorials/pull/385 and the docstring correspondingly in this PR, explaining the Jacobian vector product. Please review this PR and https://github.com/pytorch/tutorials/pull/385 together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15197 Differential Revision: D13476513 Pulled By: soumith fbshipit-source-id: bee62282e9ab72403247384e4063bcdf59d40c3c --- torch/autograd/__init__.py | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py index 9b961c1..0fe63a8 100644 --- a/torch/autograd/__init__.py +++ b/torch/autograd/__init__.py @@ -40,10 +40,12 @@ def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, The graph is differentiated using the chain rule. If any of ``tensors`` are non-scalar (i.e. their data has more than one element) and require - gradient, the function additionally requires specifying ``grad_tensors``. - It should be a sequence of matching length, that contains gradient of - the differentiated function w.r.t. corresponding tensors (``None`` is an - acceptable value for all tensors that don't need gradient tensors). + gradient, then the Jacobian-vector product would be computed, in this + case the function additionally requires specifying ``grad_tensors``. + It should be a sequence of matching length, that contains the "vector" + in the Jacobian-vector product, usually the gradient of the differentiated + function w.r.t. corresponding tensors (``None`` is an acceptable value for + all tensors that don't need gradient tensors). This function accumulates gradients in the leaves - you might need to zero them before calling it. @@ -51,10 +53,11 @@ def backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, Arguments: tensors (sequence of Tensor): Tensors of which the derivative will be computed. - grad_tensors (sequence of (Tensor or None)): Gradients w.r.t. - each element of corresponding tensors. None values can be specified for - scalar Tensors or ones that don't require grad. If a None value would - be acceptable for all grad_tensors, then this argument is optional. + grad_tensors (sequence of (Tensor or None)): The "vector" in the Jacobian-vector + product, usually gradients w.r.t. each element of corresponding tensors. + None values can be specified for scalar Tensors or ones that don't require + grad. If a None value would be acceptable for all grad_tensors, then this + argument is optional. retain_graph (bool, optional): If ``False``, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to ``True`` is not needed and often can be worked around in a much more efficient @@ -95,8 +98,9 @@ def grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=Fal r"""Computes and returns the sum of gradients of outputs w.r.t. the inputs. ``grad_outputs`` should be a sequence of length matching ``output`` - containing the pre-computed gradients w.r.t. each of the outputs. If an - output doesn't require_grad, then the gradient can be ``None``). + containing the "vector" in Jacobian-vector product, usually the pre-computed + gradients w.r.t. each of the outputs. If an output doesn't require_grad, + then the gradient can be ``None``). If ``only_inputs`` is ``True``, the function will only return a list of gradients w.r.t the specified inputs. If it's ``False``, then gradient w.r.t. all remaining @@ -107,10 +111,10 @@ def grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=Fal outputs (sequence of Tensor): outputs of the differentiated function. inputs (sequence of Tensor): Inputs w.r.t. which the gradient will be returned (and not accumulated into ``.grad``). - grad_outputs (sequence of Tensor): Gradients w.r.t. each output. - None values can be specified for scalar Tensors or ones that don't require - grad. If a None value would be acceptable for all grad_tensors, then this - argument is optional. Default: None. + grad_outputs (sequence of Tensor): The "vector" in the Jacobian-vector product. + Usually gradients w.r.t. each output. None values can be specified for scalar + Tensors or ones that don't require grad. If a None value would be acceptable + for all grad_tensors, then this argument is optional. Default: None. retain_graph (bool, optional): If ``False``, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to ``True`` is not needed and often can be worked around in a much more efficient -- 2.7.4