class SGD(Optimizer):
r"""Implements stochastic gradient descent (optionally with momentum).
+ .. math::
+ \begin{aligned}
+ &\rule{110mm}{0.4pt} \\
+ &\textbf{input} : \gamma \text{ (lr)}, \: \theta_0 \text{ (params)}, \: f(\theta)
+ \text{ (objective)}, \: \lambda \text{ (weight decay)}, \\
+ &\hspace{13mm} \:\mu \text{ (momentum)}, \:\tau \text{ (dampening)},\:nesterov\\[-1.ex]
+ &\rule{110mm}{0.4pt} \\
+ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\
+ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\
+ &\hspace{5mm}\textbf{if} \: \lambda \neq 0 \\
+ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\
+ &\hspace{5mm}\textbf{if} \: \mu \neq 0 \\
+ &\hspace{10mm}\textbf{if} \: t > 1 \\
+ &\hspace{15mm} \textbf{b}_t \leftarrow \mu \textbf{b}_{t-1} + (1-\tau) g_t \\
+ &\hspace{10mm}\textbf{else} \\
+ &\hspace{15mm} \textbf{b}_t \leftarrow g_t \\
+ &\hspace{10mm}\textbf{if} \: nesterov \\
+ &\hspace{15mm} g_t \leftarrow g_{t-1} + \mu \textbf{b}_t \\
+ &\hspace{10mm}\textbf{else} \\[-1.ex]
+ &\hspace{15mm} g_t \leftarrow \textbf{b}_t \\
+ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \gamma g_t \\[-1.ex]
+ &\rule{110mm}{0.4pt} \\[-1.ex]
+ &\bf{return} \: \theta_t \\[-1.ex]
+ &\rule{110mm}{0.4pt} \\[-1.ex]
+ \end{aligned}
+
Nesterov momentum is based on the formula from
`On the importance of initialization and momentum in deep learning`__.