CMU 10-414/714: Deep Learning Systems (2020) - 深度学习系统 02-03 Neural Networks
“Manual” Neural Networks / Backprop
还是复习 ML 的内容
- From linear to nonlinear hypothesis classes
- Neural networks
- Backpropagation (i.e., computiing gradients)
hypothesis classes 是用于模型训练的函数类型
线性假设类、非线性假设类
多项式回归或神经网络等模型都属于非线性假设类,能捕捉更复杂的关系
以下是一个两层神经网络的简化表示:
$$h_\theta(x) = \theta_2^T \sigma(W_1^T x)$$
其中,$W_1$ 和 $W_2$ 是权重矩阵,$\sigma$是非线性激活函数,如 ReLU 或 sigmoid。
nonlinear hypothesis classes
linear hypothesis classes: $h_\theta(x) = \theta^T x$
线性的分类可能可以 fit 一些情况,但如果类型是分散的圆圈等等情况,可能难以拟合?
One idea: apply a linear classifier to some (potentially higher-dimensional) features of the data: $h_\theta(x) = \theta^T \phi(x), \theta \in R^{d\times k}, \phi: R^n \rightarrow R^d$
phi 把 n 维的输入变成 d 维的
隐藏层节点的激活函数(如 ReLU)可以视为一种非线性特征映射,这使得神经网络能够捕捉到数据中的复杂非线性关系。
How do we create features?
- manual engineering, the “old” way of doing machine learning
- In a way that itself is learned from data, the “new” way of doing ML
传统机器学习,手动提取特征比如房价预测里的面积、房间特征
Neural networks / deep learning
Neural Network, a particular type of hypothesis class, multiple, parameterized differentiable functions (a.k.a. “layers”) composed together
Deep Network, synonym for “neural network”, composing together a lot of functions, so “deep” is typically an appropriate qualifier
深度学习,指的就是用神经网络的机器学习?
two layer neural network
$h_\theta(x) = W^T_2\sigma(W_1^Tx)$
where $\sigma: R \rightarrow R$ is nonlinear function like ReLU or sigmoid
$\sigma$ 非线性函数, 比如 ReLU,sigmod
$\theta$ 可以看作是参数
batch matrix form:
$$ h_\theta(X) = \sigma(X W_1) W_2 $$
为什么加一个 non linear 就可以表示这么多结果?
数学上:神经网络已经被证明为通用函数逼近器(Universal Function Approximators)
理论上,只要网络结构足够复杂(隐藏单元和层数足够多),它就能近似任何连续函数。
之前学过一点数值计算,当时还没接触到非线性函数可以拟合的函数,基本都是线性/指数或者分段/平滑等等,应该继续学一下非线性拟合的
Fully-connected deep networks
𝐿-layer neural network – a.k.a. “Multi-layer perceptron” (MLP)
$$Z_{i+1} = \sigma_i(Z_i W_i), \quad i = 1, \dots, L$$
$$h_\theta(X) = Z_{L+1}$$
每一层的输入都是上一层的 非线性函数 (输出 x 权重矩阵)
why deep networks?
work like the brain?
parity
奇偶性无法学习?
empirically it seems like they work better for a fixed parameter count
多层结构/深层网络能更均匀地分布参数
Backpropagation
neural networks:
- Hypothesis Class: MLP
- Loss function: cross-entropy loss
- Optimization procedure: SGD
$$\min_\theta \frac{1}{m} \sum_{i=1}^m \ell_{ce}(h_\theta(x_i), y_i) $$
我知道这些组件,但具体如何运作?
The gradient(s) of a two-layer network
$$ \nabla_{{W_1, W_2}} \ell_{ce}(\sigma((XW_1)W_2), y) = \sigma(XW_1)^T \cdot (S - I_y) $$
链式法则(chain rule)计算偏导数
偏导数是多变量函数对其中一个变量的变化率,而保持其他变量固定不变
梯度是多变量函数所有偏导数组成的向量,梯度的方向用于描述函数值增长最快的路径,负方向是函数值下降最快的路径
对多元函数,假设有个初始点 $(x_0, y_0)$ 可以求出梯度,更新其值 $x_1 = x_0 - \eta \cdot \frac{\partial f}{\partial x}$ 其中 $\eta$ 是学习率,一旦梯度足够小,就可以减少原函数的值,也就是损失函数的值减少
Backpropagation “in general”
consider our fully-connected network
$$ Z_{i+1} = \sigma_i(Z_i W_i), i = 1, …, L \ \frac{\partial \ell}{\partial W_i} = \frac{\partial \ell}{\partial Z_{L+1}} \cdot \frac{\partial Z_{L+1}}{\partial Z_L} \cdot \frac{\partial Z_L}{\partial Z_{L-1}} \cdot \ldots \cdot \frac{\partial Z_{i+2}}{\partial Z_{i+1}} \cdot G_{i+1} $$
Then we have a simple ”backward” iteration to compute the $G_i$’s
$$ G_i = G_{i+1} \cdot \frac{\partial \ell}{\partial Z_{i+1}} $$
Computing the real gradients
$$ \nabla_{W_i} \ell = Z_i^T \cdot (G_{i+1} \circ \sigma_i’(Z_i W_i)) $$
链式法则矩阵计算是反向传播(Backpropagation)的核心思想。它利用链式法则(Chain Rule)和矩阵计算,逐层向后传递误差
其中 $Z_i$ 是可以复用的
Backpropagation: Forward and backward passes
为什么这些计算是“同时”发生的?
- 前向传播的过程中,我们已经缓存了每一层的输出 $Z_i$ 和激活函数的导数 $\sigma’(Z_i W_i)$
- 后向传播时,利用这些缓存值直接进行梯度计算,避免重复计算中间值。
What is really happening with the backward iteration?
$$ \frac{\partial \ell}{\partial W_i} = \frac{\partial Z_{i+1}}{\partial W_i} \cdot \frac{\partial \ell}{\partial Z_{L+1}} \cdot \frac{\partial Z_{L+1}}{\partial Z_L} \cdot \ldots \cdot \frac{\partial Z_{i+2}}{\partial Z_{i+1}} $$
Each layer needs to be able to multiply the “incoming backward” gradient $G_{i+1}$ by its derivatives $\frac{\partial Z_{i+1}}{\partial W_i}$ an operation called the “vector Jacobian product”
automatic differentiation
每一层的梯度 $G_i$ 是从后一层的梯度 $G_{i+1}$ 递归计算得到。
向量雅可比乘积
反向传播可以推广到任意计算图,成为实现自动微分的基础