From: Ilqar Ramazanli Date: Thu, 9 Sep 2021 14:03:49 +0000 (-0700) Subject: [doc][hackathon] To add AdamW Optimizer to the documentation (#63252) X-Git-Tag: accepted/tizen/8.0/unified/20231005.095509~345 X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=5b21f172a4ecba1712ae5338e2c8095af48e19a3;p=platform%2Fupstream%2Fpytorch.git [doc][hackathon] To add AdamW Optimizer to the documentation (#63252) Summary: It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper https://github.com/pytorch/pytorch/issues/63236. In this PR we are adding description of AdamW Algorithm to the documentation. For more details, we refer to the paper here https://arxiv.org/abs/1711.05101 AdamWalgo Pull Request resolved: https://github.com/pytorch/pytorch/pull/63252 Reviewed By: datumbox Differential Revision: D30839685 Pulled By: iramazanli fbshipit-source-id: 1a426c874ab86408d286a34f41aefcf5b21167c0 --- diff --git a/torch/optim/adamw.py b/torch/optim/adamw.py index d2ea738..9d31999 100644 --- a/torch/optim/adamw.py +++ b/torch/optim/adamw.py @@ -6,8 +6,37 @@ from .optimizer import Optimizer class AdamW(Optimizer): r"""Implements AdamW algorithm. - The original Adam algorithm was proposed in `Adam: A Method for Stochastic Optimization`_. - The AdamW variant was proposed in `Decoupled Weight Decay Regularization`_. + .. math:: + \begin{aligned} + &\rule{110mm}{0.4pt} \\ + &\textbf{input} : \gamma \text{(lr)}, \: \beta_1, \beta_2 + \text{(betas)}, \: \theta_0 \text{(params)}, \: f(\theta) \text{(objective)}, + \: \epsilon \text{ (epsilon)} \\ + &\hspace{13mm} \lambda \text{(weight decay)}, \: amsgrad \\ + &\textbf{initialize} : m_0 \leftarrow 0 \text{ (first moment)}, v_0 \leftarrow 0 + \text{ ( second moment)}, \: \widehat{v_0}^{max}\leftarrow 0 \\[-1.ex] + &\rule{110mm}{0.4pt} \\ + &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ + &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ + &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ + &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ + &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ + &\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ + &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ + &\hspace{5mm}\textbf{if} \: amsgrad \\ + &\hspace{10mm}\widehat{v_t}^{max} \leftarrow \mathrm{max}(\widehat{v_t}^{max}, + \widehat{v_t}) \\ + &\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/ + \big(\sqrt{\widehat{v_t}^{max}} + \epsilon \big) \\ + &\hspace{5mm}\textbf{else} \\ + &\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/ + \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ + &\rule{110mm}{0.4pt} \\[-1.ex] + &\bf{return} \: \theta_t \\[-1.ex] + &\rule{110mm}{0.4pt} \\[-1.ex] + \end{aligned} + + For further details regarding the algorithm we refer to `Decoupled Weight Decay Regularization`_. Args: params (iterable): iterable of parameters to optimize or dicts defining @@ -22,8 +51,6 @@ class AdamW(Optimizer): algorithm from the paper `On the Convergence of Adam and Beyond`_ (default: False) - .. _Adam\: A Method for Stochastic Optimization: - https://arxiv.org/abs/1412.6980 .. _Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101 .. _On the Convergence of Adam and Beyond: