An Overview of non NN ML math

01/07/2022

For anyone interested in the math behind linear classifiers and other simple topics, here's a quick writeup. I will also post some math for NN stuff. (Please excuse the weird formatting, apparently webnode doesn't support latex to I had to make into HTML)

ML stuff

ML stuff

Intro

What is ML?

3 types:

  • Supervised: access to labeled examples of the correct behaviour
    • ex: handwriting reader
  • Reinforcement: learning agent interacts with world and learn to maximize a reward system
    • ex: mario player, DaVinci robot sim (ROP)
  • Unsupervised: no labeled examples - instead, looking for "interesting" patterns in the data
    • ex: customer segmentation

Preliminaries and the Nearest Neighbourhood Methods

Machine learning algorithms need to handle lots of data, so we represent the input as a vector in Rd\R^d.

Representation: mapping to another space that is easy to manipulate

(vectors are great since we can do linear algebra)

Supervised Learning Setup

Our training set consists of a collection of pairs of an input vector xRd\bold{x}\in\R^d and its corresponding target, or label tt.

2 types:

  • Regression

    tt is a real number

  • Classification

    tt is an element of a discreet set {1,...,C}\{1, ..., C\}

These days, tt is often a highly structured object

Training set will look like this: {(x(1),t(1)),...,(x(N),t(N))}\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}

These superscripts have nothing to do with exponentials!

Nearest Neighbour

💡
The idea: find the nearest input vector to x\bold{x} in the training set and copy its label.

  • Closest (in Euclidean distance):
    x(b)x(b)2=j=1d(xj(a)xj(b))2||\bold{x}^{(b)}-\bold{x}^{(b)}||_{2}=\sqrt{\sum^{d}_{j=1}(x_{j}^{(a)}-x_{j}^{(b)})^2}

The Algorithm:

  1. Given input x\bold{x} and training set {(x(1),t(1)),...,(x(N),t(N))}\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}
  1. Return y=t()y=t^{(*)} where training pair (x(),t())(\bold{x}^{(*)}, t^{(*)})'s x()\bold{x^{(*)}} minimizes distance xx()2||\bold{x}-\bold{x}^{(*)}||_{2}

Estimating Model Accuracy

  • In essence:
    Accuracy=Number of Correct PredictionsTotal Number of Predictions\text{Accuracy}=\frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
⚠️
BUT! we cannot use the training set!
  • We need to simulate new data. Using the data the model was trained on is not a realistic representation of unseen data.
  • We will always get perfect performance for 1-Nearest Neighbor.

We set aside a chunk of our data for testing, called the test set.

Problems

  • NN is sensitive to noise or mis-labeled data.
    • Solution...

k-Nearest Neighbors

Have the k closest data points "vote".

The Algorithm:

  1. Given input x\bold{x} and training set {(x(1),t(1)),...,(x(N),t(N))}\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}
  1. Find k examples (x(),t())(\bold{x}^{(*)}, t^{(*)}) closest to x\bold{x}
  • The classification output is majority class:
    y=arg maxt(z)i=1kI{t(z)=t(i)}y=\argmax\limits_{t^{(z)}}\sum^{k}_{i=1}\mathbb{I}\{t^{(z)}=t^{(i)}\}

How do we choose k?

  • Small kk:
    • Good at capturing fine grained patterns
    • May overfit: be sensitive to random idiosyncrasies in the training data.
  • Large kk:
    • Makes stable predictions by averaging over lots of examples
    • May underfit: fail to capture important regularities.
  • Balancing kk:
    • The optimal choice of kk depends on the number of data points nn.
    • Rule of thumbs: k=n22+dk=n^{\frac{2}{2+d}}.

Choice of kk: Generalization

kk is a hyperparameter, something we can't fit as part of the learning algorithm itself.

We need to tune it: vary kk and measure the generalization error using the validation set.

Not test set because test set is supposed to simulate unseen data so we can't use it to tune our model in any way.

Training, test, and validation sets

Partition the data into three subsets.

  • Training set: used to train the model
  • Validation set: used to tune hyperparameters
  • Test set: used to evaluate final performance once, after the hyperparameters are chosen.

Pitfalls

Normalization

  • NN can be sensitive to the ranges of different features
  • Therefore, we normalize each dimension to be zero mean and unit variance.
    x~j=xjμjσj\tilde{x}_j=\frac{x_j-\mu_j}{\sigma_j}

Computational Cost

  • Computations at training time: 0
  • Computations at test time per query:
    • Calculate D-dimensional Euclidian distances with N data points: O(ND)\mathcal{O}(ND)
    • Sort the distances: O(NlogN)\mathcal{O}(N\log N)

    This must be done for each query! very expensive!

  • We need to store the entire dataset in memory.

Modular Approach to ML Algorithm Design

  • Choose a model describing the relationships between variables of interest
  • Define a loss function quantifying how bad the fit to the data is
  • Choose a regularizer saying how much we prefer different candidate models (or explanations of data)
  • Fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm.

Modular Approach to Supervised Learning

  • Training data D={(x(i),t(i) for i=1,2,...,N)}\mathcal{D}=\{(\bold{x}^{(i)}, t^{(i)} \text{ for }i=1,2,...,N)\}
  • Objective: learn the function ff such that:
    ty=f(x)t\approx y = f(x)

    Here, tt is the ground truth and yy is the prediction.

  • A model is a set of assumptions (or restrictions) about the structure of ff in ty=f(x)t\approx y = f(x).
    • The model (or architecture) restricts the hypothesis space, or the possible functions ff that we learn. aka, what ff could be.

Linear Regression Model

  • ff is restricted to linear functions of the input x=(x1,...,xD)\bold{x}=(x_1,...,x_D):
    y=f(x)=jwjxj+by=f(\bold{x})=\sum_{j}w_jx_j+\mathcal{b}
    • yy is the prediction
    • w\bold{w} is the weights
    • b\mathcal{b} is the bias
      • w\bold{w} and b\mathcal{b} together are the parameters

We want to choose good parameter values so that our prediction is close to the target: yty\approx t.

  • We use the loss function L\mathcal{L} to quantify good or bad values of our parameters:
    arg min(w,b)i=1NL(y(i),t(i))\argmin_{(\bold{w},\mathcal{b})}\sum^{N}_{i=1}\mathcal{L}(y^{(i)},t^{(i)})

This means y(i)=wx(i)+by^{(i)}=wx^{(i)}+\mathcal{b} should be "close to" t(i)t^{(i)} for all ii.

Loss Function: Square Error Loss

L(y,t)=12(yt)2\mathcal{L}(y,t)=\frac{1}{2}(y-t)^2

yty-t is the residual, and we want to make its magnitude small.

Loss Function vs Cost Function

  • The cost function is the loss function averaged over all training examples:
    J(w,b)=12Ni=1N(y(i)t(i))2=12Ni=1N(wx(i)+bt(i))2\mathcal{J}(\bold{w},\mathcal{b})=\frac{1}{2N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})^2=\frac{1}{2N}\sum^{N}_{i=1}(\bold{w}^\top \bold{x}^{(i)}+\mathcal{b}-t^{(i)})^2

What now??

Finding the parameters

  • We find (w,b)(\bold{w}, \mathcal{b}) by minimizing the loss over all the data points:
    arg min(w,b)=i=1NL(y(i),t(i))\argmin_{(\bold{w}, \mathcal{b})}=\sum^{N}_{i=1}\mathcal{L}(y^{(i)},t^{(i)})

Let's vectorize the notation.

Vectorizing Linear Regression

  • Organize all training examples into design matrix X\bold X (N×DN\times \mathcal{D}) with one row per example, and one column per feature:
    X=(x(1)x(2)...x(N))=(x11x21...xD1x12x22...xD2............x1Nx2N...xDN)\bold{X}=\begin{pmatrix}\bold{x}^{(1)^\top} \\ \bold{x}^{(2)^\top}\\...\\\bold{x}^{(N)^\top}\end{pmatrix}=\begin{pmatrix}x^{1}_{1} & x^{1}_{2} & ... & x^{1}_{\mathcal{D}}\\x^{2}_{1} & x^{2}_{2} & ... & x^{2}_{\mathcal{D}}\\...&...&...&...\\x^{N}_{1} & x^{N}_{2} & ... & x^{N}_{\mathcal{D}}\end{pmatrix}
    t=(t(1)...t(N))t=\begin{pmatrix}t^{(1)}\\...\\t^{(N)}\end{pmatrix}
  • In vectorized form we get:
    Xw+b1=(wx(1)+b...wx(N)+b)=(y(i)...y(N))=y\bold{Xw}+b\bold{1}=\begin{pmatrix}\bold{w^\top}\bold{x}^{(1)}+b\\...\\\bold{w^\top}\bold{x}^{(N)}+b\end{pmatrix}=\begin{pmatrix}y^{(i)}\\...\\y^{(N)}\end{pmatrix}=\bold{y}

    1\bold{1} is a column vector of ones

  • Squared error cost becomes:
    y=Xw+b1\bold{y=Xw}+b\bold{1}
    J=12Nyt2\mathcal{J}=\frac{1}{2N}||\bold{y-t}||^2

    Note: sometimes we write J=12yt2\mathcal{J}=\frac{1}{2}||\bold{y-t}||^2, which would correspond to the sum of losses. The optimal (w,b)(\bold{w}, \mathcal{b}) do not depend on NN.

For convenience, we can add a column of 1s to the design matrix and combine the bias and weights to get: y=Xw\bold{y=Xw}.

Solving the minimization problem

We would like to minimize the cost function. Recall: the minimum of a smooth function appears at a critical point, where the derivative is zero.

We'll find a point where the gradient (deriv) is (close to) zero:

  • Sometimes it is possible to directly find the parameters that make the gradient zero in a closed-form. We call this the direct solution.
  • We may use optimization techniques that iteratively get us closer to the solution.

Direct Solution

Partial derivatives: derivatives of a multivariate function with respect to one of its arguments. Pretend the other variables are constants.

  • Loss function:
    Lwj=Lyywj=(yt)xj\frac{\partial\mathcal{L}}{\partial w_j}=\frac{\partial\mathcal{L}}{\partial y}\frac{\partial y}{\partial w_j}=(y-t)x_j
    Lb=yt\frac{\partial\mathcal{L}}{\partial b}=y-t
  • Cost function:
    Jwj=1Ni=1N(y(i)t(i))xj(i)\frac{\partial \mathcal{J}}{\partial w_j}=\frac{1}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x^{(i)}_{j}
    Jb=1Ni=1Ny(i)t(i)\frac{\partial \mathcal{J}}{\partial b}=\frac{1}{N}\sum^{N}_{i=1}y^{(i)}-t^{(i)}

The minimum must occur at a critical point where the partial derivatives are zero: Jwj=0,(j)\frac{\partial \mathcal{J}}{\partial w_j}=0 , (\forall j) and Jb=0\frac{\partial \mathcal{J}}{\partial b}=0.

It turns out this gives a system of linear equations, which we can solve efficiently: wLS=(XX)1Xt\bold{w}^{\text{LS}}=\bold{(X^\top X)^{-1} X^\top t}. Direct solution, yay!

Feature Mapping (Basis Expansion)

The relation between input and output may not be linear.

We can still use linear regression by mapping the input feature to another (higher dimension) space using feature mapping (or basis expansion) ψ(x):RDRd\bold{\psi(x)}:\R^D\rightarrow\R^d and treat the mapped feature (in Rd\R^d) as the input of a linear regression procedure.

Polynomial Feature Mapping and regularizing cost

Map features to polynomials of degree M: y=w0+w1x+w2x2+...wMxM=i=0Mwixiy=w_0+w_1x+w_2x^2+...w_Mx^M=\sum^{M}_{i=0}w_ix^i (notice these are exponentials)

Here, the feature mapping is ψ(x)=[1,x,x2,...]\psi(x)=[1,x,x^2,...]^\top

Notice y=ψ(x)wy=\psi(x)^\top \bold{w} is linear in w0,w1,...w_0,w_1,...

M is now a new hyperparameter. We tune it with the validation set. We don't want M to be too big or our model will be too complex, rather we want to keep simpler solutions within the same space of parameters. This is done through regularization or penalization: R(w)=12w22=12jwj2\mathcal{R}(\bold{w})=\frac{1}{2}||\bold{w}||^{2}_{2}=\frac{1}{2}\sum_{j}w^{2}_{j}

  • The regularized cost function makes a tradeoff between fit to the data and the norm of the weights:
    Jreg(w)=J(w)+λR(w)=J(w)+λ2jwj2\mathcal{J}_\text{reg}(\bold{w})=\mathcal{J}(\bold{w})+\lambda\mathcal{R}(\bold{w})=\mathcal{J}(\bold w)+\frac{\lambda}{2}\sum_{j}w^{2}_{j}
  • If you fit you data poorly, J\mathcal{J} is large. If your optimal weights have high values, R\mathcal{R} is large.
  • Large λ\lambda penalizes weight values more. Its a new hyperparameter, tune it with the validation set.

For the least square problem, we have J(w)=12NXwt2\mathcal{J}(\bold w)=\frac{1}{2N}||\bold{Xw-t}||^2.

  • When λ>0\lambda >0, the regularized cost gives:
    wλRidge=arg minwJreg(w)=arg minw12NXwt22+λ2w22=(XX+λNI)1Xt\bold{w}_\lambda^{\text{Ridge}}=\argmin_\bold{w}\mathcal{J}_\text{reg}(\bold w)=\argmin_\bold{w}\frac{1}{2N}||\bold{Xw-t}||^2_2+\frac{\lambda}{2}||\bold{w}||^2_2=\bold{(X^\top X+\lambda}\text{N}\bold{I})^{-1}\bold{X^\top t}

Probabilistic Interpretation of the Squared Error

  • For the least squares: we minimize the sum of the squares of the errors between the predictions for each data point x(i)\bold{x}^{(i)}and the corresponding target values t(i)t^{(i)}:
    arg min(w,b)i=0n(wx(i)+bt(i))2\argmin_{(\bold{w},b)}\sum^{n}_{i=0}(\bold{w^\top x}^{(i)}+b-t^{(i)})^2

Why do we use the squared error loss to measure the quality of the fit? Here's a probabilistic perspective.

  • Suppose our model arose from a statistical model (b=0b=0 for simplicity): y(i)=wx(i)+ϵ(i)y^{(i)}=\bold{w^\top x}^{(i)}+\epsilon^{(i)} where ϵ(i)N(0,σ2)\epsilon^{(i)}\sim\mathcal{N}(0, \sigma^2) is independent of anything else.
  • Thus, y(i)x(i)p(yx(i),w)=N(wx(i),σ2)y^{(i)}|\bold{x}^{(i)}\sim p(y|\bold{x}^{(i)}, \bold{w})=\mathcal{N}(\bold{w^\top x}^{(i)},\sigma^2)

Under these assumptions, the Least-Squares solution is the Maximum-Likelihood solution.

Cf: Maximum Likelihood Estimation (lec 2, page 33)

Gradient Descent

A way of minimizing the cost function that is more broadly applicable.

GD is an iterative algorithm. We initialize the weights to something reasonable (eg: all zeroes) and repeatedly adjust them in the direction of the steepest descent.

xxαf(x)x\leftarrow x - \alpha f'(x)
  • if Jwj>0\frac{\partial \mathcal{J}}{\partial w_j}>0, then increasing wjw_j increases J\mathcal{J}
  • if Jwj<0\frac{\partial \mathcal{J}}{\partial w_j}<0, then increasing wjw_j decreasesJ\mathcal{J}
  • We update the weights using the cost function:
    wjwjαJwj=wjαNi=1N(y(i)t(i))xj(i)w_j\leftarrow w_j-\alpha\frac{\partial\mathcal{J}}{\partial w_j}=w_j-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x_j^{(i)}
  • α\alpha is the learning rate. The larger it is, the faster w\bold{w} changes. The values are typically small eg: 0.01 or 0.0001. It's a hyperparameter.
    • Too small: slow progress → lots of computation
    • Too large: instability (might diverge), oscillations
  • The algorithm gets its name from the gradient:
    Jw=(Jw1...JwD)\frac{\partial \mathcal{J}}{\partial \bold{w}}=\begin{pmatrix}\frac{\partial\mathcal{J}}{\partial w_1}\\...\\\frac{\partial\mathcal{J}}{\partial w_D}\end{pmatrix}
  • In vectorized form we update all the weights at the same time:
    wwαJw=wαNi=1N(y(i)t(i))x(i)\bold{w}\leftarrow \bold{w}-\alpha\frac{\partial\mathcal{J}}{\partial \bold{w}}=\bold{w}-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})\bold{x}^{(i)}

Notice that when it converges, we get a critical point.

  • Why gradient descent?
    • GD can be applied to a much broader set of models
    • GD can be easier to implement than direct solutions
    • For regression in high-dimensional spaces, GD is actually more efficient than direct solution

Gradient Descent under the l2 Regularization

  • The gradient descent update of the regularized cost J+λR\mathcal{J}+\lambda\mathcal{R} has an interesting interpretation as weight decay:
    wα(Jw+λRw)=(1αλ)wαJw\bold{w}\leftarrow\alpha\left(\frac{\partial\mathcal{J}}{\partial\bold w}+\lambda\frac{\partial\mathcal{R}}{\partial\bold w}\right)=(1-\alpha\lambda)\bold{w}-\alpha\frac{\partial\mathcal{J}}{\partial \bold{w}}

    The weight in calculations shrinks at each iterations, it decays.

Linear Classification

Classification setup

  • Classification: predicting a discrete-values target
    • Binary classification: predicting a binary-valued target
  • Recall the notation:
    • Training data: (x(1),t(1)),(x(2),t(2)),...(x(N),t(N))(\bold{x}^{(1)}, t^{(1)}),(\bold{x}^{(2)}, t^{(2)}),...(\bold{x}^{(N)}, t^{(N)})
    • x(i)\bold{x}^{(i)} are the inputs
    • t(i)t^{(i)} are the (discreet value) targets, in binary classification t{0,1}t\in\{0,1\} with 1 being positive examples and 0 being negative.
      • {0,1}\{0,1\} or {1,+1}\{-1, +1\} for computational convenience
  • Linear: model is a linear function of x\bold{x}, followed by a threshold rr:
    z=wx+by={1 if zr0 if z<rz=\bold{w^\top x}+b\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq r \\0 \quad\text{ if }\medspace z<r\end{cases}
  • We can eliminate the threshold:
    wx+brwx+br0\bold{w^\top x}+b\geq r \Longleftrightarrow \bold{w^\top x} + b - r \geq 0
  • We can eliminate the bias by adding a dummy feature that is always 1 and putting the bias in the weight matrix (some with linear regression):
    x0=1w0=bx_0=1 \medspace\land \medspace w_0=b
  • Simplified model
    z=wxy={1 if z00 if z<0z=\bold{w^\top x}\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq 0 \\0 \quad\text{ if }\medspace z< 0\end{cases}

The geometric picture

The Input Space or Data Space

  • Training examples are points
  • Weights (hypotheses) w\bold w can be represented by half-spaces H+={x:wx0},H={x:wx<0}H_+=\{\bold{x:w^\top x}\geq0\}, \medspace H_-=\{\bold{x:w^\top x}<0\}

    The boundaries of these half-spaces pass through the origin.

  • The boundary is the decision boundary: {x:wx=0}\{\bold{x:w^\top x}=0\}

    In 2-D its a line, but think of it as a hyperplane.

  • If the training examples can be perfectly separated by a linear decision rule, we say the data is linearly separable.
    • If training set is separable, we can solve for w,b\bold w, b using linear programming. Otherwise you'll have to resort to logistic regression.

The Weight Space

  • Weights (hypotheses) w\bold w are points
  • Each training example x\bold x specifies a half-space w\bold w must lie in to be correctly classified: wx>0 if t=1\bold{w^\top x}>0 \text{ if } t=1.
  • The region satisfying all the constraints (put on by the examples) is the feasible region; if this region is nonempty, the problem is feasible. Otherwise its infeasable.

Loss Function

  • To solve the parameters in a non linearly separable set, we define a loss function then try to minimize the resulting cost function.
    • Recall: cost is loss averaged (or summed) over the training set
  • Seemingly obvious loss functiohalfn: 0-1 loss (different from square loss!)
    z=wxy={1 if z00 if z<0z=\bold{w^\top x}\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq 0 \\0 \quad\text{ if }\medspace z< 0\end{cases}
    L01(y,t)={0 if y=t1 if yt=I[yt]\mathcal{L}_{0-1}(y,t)=\begin{cases}0 \quad\text{ if } \medspace y=t\\1 \quad\text{ if } \medspace y\neq t\end{cases}=\mathbb{I}[y\neq t]

    The problem is that this function has zero derivative almost everywhere, its useless for gradient descent.

  • Sometimes we can replace the loss function we care about with one that is easier to optimize. This is known as relaxation with a smooth surrogate loss function.
  • What happens if we use the square error loss instead? and do it directly on zz and not yy:
    z=wx+bz=\bold{w^\top x}+b
    LSE(z,t)=12(zt)2\mathcal{L}_{\text{SE}}(z,t)=\frac{1}{2}(z-t)^2

    Aka: doesn't matter the target are binary, treat them as continuous values.

    Problem: the loss function penalizes you when you make correct predictions with high confidence!

    With t=1t=1, the loss is still larger when z=10z=10 than when z=0z=0

There is obviously no reason to predict values outside of [0,1][0,1].L Let's squash yy into this interval (and keep the loss optimizable).

  • The logistic function is a kind of sigmoid:
    y=σ(z)=11+ezy=\sigma(z)=\frac{1}{1+e^{-z}}
    • σ1(y)=log(y(1y))\sigma^{-1}(y)=\log(\frac{y}{(1-y)}) is called the logit.
    • A linear model with a logistic nonlinearity is known as log-linear:
      z=wx+by=σ(z)LSE(y,t)=12(yt)2z=\bold{w^\top x}+ b\\y=\sigma(z)\\\mathcal{L}_{\text{SE}}(y,t)=\frac{1}{2}(y-t)^2
    • Used this way, σ\sigma is called an activation function.
    • Problem: for z0z\ll0, we have σ(z)0\sigma(z)\approx 0.
      • If the prediction is really wrong, you should be far from a critical point! Here, you are very close.
  • Cross entropy loss:
    • Because y[0,1]y\in [0,1] we can interpret it as the estimated probability that t=1t=1.
    LCE(y,t)={logyif t=1log(1y)if t=0=tlogy(1t)log(1y)\mathcal{L}_\text{CE}(y,t)=\begin{cases}-\log y \quad \text{if } t=1\\-\log(1-y)\quad \text{if } t=0\end{cases}=-t\log y - (1-t)\log(1-y)

Logistic Regression (on cross-entropy loss)

z=wx+by=σ(z)=11+ez(logit)LCE=tlogy(1t)log(1y)z=\bold{w^\top x} + b\\y=\sigma(z)=\frac{1}{1+e^{-z}} \quad \text{(logit)}\\\mathcal{L}_\text{CE}=-t\log y-(1-t)\log(1-y)
  • Problem: let t=1t=1 but z0z\ll0. Then, if yy is small enough, it may be numerically zero. This can cause very subtle, hard to find bugs. (log 0 is undefined)
  • So we combine the activation function with the loss into a single logistic-cross-entropy function:
    LLCE(z,t)=LCE(σ(z),t)=tlog(1+ez)+(1t)log(1+ez)\mathcal{L}_\text{LCE}(z,t)=\mathcal{L}_\text{CE}(\sigma(z), t)=t\log(1+e^{-z})+(1-t)\log(1+e^{z})

    This gives us numerically stable computation.

Gradient Descent for Logistic Regression

z=wx+by=σ(z)=11+ez(logit)LCE=tlogy(1t)log(1y)z=\bold{w^\top x} + b\\y=\sigma(z)=\frac{1}{1+e^{-z}} \quad \text{(logit)}\\\mathcal{L}_\text{CE}=-t\log y-(1-t)\log(1-y)

So we get:

LCEwj=LCEyyz zwj=(ty+1t1y)y(1y)xj=(yt)xj\frac{\partial\mathcal{L}_\text{CE}}{\partial w_j}=\frac{\partial\mathcal{L}_\text{CE}}{\partial y}\cdot \frac{\partial y}{\partial z}\cdot \frac{\partial\ z}{\partial w_j}=\left(-\frac{t}{y}+\frac{1-t}{1-y}\right)\cdot y(1-y)\cdot x_j=(y-t)x_j

Gradient descent (coordinatewise) update to find the weights of logistic regression:

wjwjαJwj=wjαNi=1N(y(i)t(i))xj(i)w_j\leftarrow w_j - \alpha\frac{\partial\mathcal{J}}{\partial w_j}=w_j-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x^{(i)}_j

Notice the GD is the same as previously seen (with linear regression). This is normal as they are both examples of generalized linear models.

Stochastic Gradient Descent

So far, the cost function J\mathcal{J} has been the average loss over the all the training examples.

Computing the gradient requires summing all of the training examples. This is known as batch training. Its impractical if you have a large dataset.

  • In Stochastic Gradient Descent (SGD) we update the parameters based on the gradient of a single training example.
    Choose i uniformly at randomθθαL(i)θ\text{Choose } i\text{ uniformly at random}\\ \theta\leftarrow\theta -\alpha\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}

    θ\theta are the parameters (probably idk)

SGD can make significant progress without even seeing all the data!

  • Mathematical justification: if you sample uniformly at random, SGD is a unbiased estimate of the batch gradient:
    E[L(i)θ]=1Ni=1NL(i)θ=Jθ\mathbb{E}\left[\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}\right]=\frac{1}{N}\sum^{N}_{i=1}\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}=\frac{\partial\mathcal{J}}{\partial\theta}
  • Problems:
    • Variance in this estimate may be high
    • If we only look at one example at a time, we can't exploit efficient vectorized operations

The compromise: mini-batch. M{1,...,N}\mathcal{M}\sub\{1,...,N\}

  • The size of the mini-batch M|\mathcal{M}| is now a hyperparameter
    • Too large: takes more computation and memory
    • Too small: can't exploit vectorization, has high variance
    • A reasonable value might be M=100|\mathcal{M}|=100
  • Problem:
    • Bad minima, but with SGD there is a chance to jump out of it.

Multi-Class Classification

Predicting a discrete(>2)-values target.

  • Targets form a discrete set {1,...,K}\{1,...,K\}
    • Represented each target as one-hot vectors, or a one-of-K encoding:
      t=(0,...,0,1,0,...,0)RK\bold{t}=(0,...,0,1,0,...,0)\in\R^K
  • Now there are DD input dimensions and KK output dimensions, so we need a K×DK\times D weight matrix W\bold W. And a KK-dimensional vector b\bold b of biases.
  • Linear predictions:
    zk=j=1Dwkjxj+bk for class k{1,2,...,K}z_k=\sum^{D}_{j=1}w_{kj}x_j+b_k \text{ for class }k\in\{1,2,...,K\}
  • Vectorized linear predictions:
    z=Wx+b\bold{z=Wx+b}

  • Predictions are like probabilities: 1yk0kyk=11\geq y_k \geq 0 \quad\land\quad\sum_k y_k=1
  • For the activation function we use the softmax function, a multivariable generalization of the logistic function:
    yk=softmax(z1,...,zK)k=ezkkezky_k=\text{softmax}(z_1,...,z_K)_k=\frac{e^{z_k}}{\sum_{k'}e^{z_{k'}}}

    Input zkz_k are called logits

    Properties:

    • Outputs are positive and sum to 1 (like probabilities) when summing all KK classes
    • softmax behaves like argmax. If one of the zkz_k is much larger than the others, softmax(z)k1\text{softmax}(\bold z)_k\approx1

  • We can use cross-entropy as the loss function since the model outputs a vector of class probabilities:
    LCE(y,t)=k=1Ktklogyk=t(logy)\mathcal{L}_\text{CE}(\bold{y,t})=-\sum^{K}_{k=1}t_k\log{y_k}=-\bold{t^\top}(\log \bold y)

    Remember, t is a one-hot vector and will be 0 for all k except the correct one.

  • Just like logistic regression, we can combine activation with loss to get softmax-cross-entropy function:

  • Softmax regression:
    z=Wx+by=softmax(z)LCE=t(logy)\bold{z=Wx+b}\\\bold{y}=\text{softmax}(\bold z)\\\mathcal{L}_\text{CE}=-\bold{t^\top(\log y)}
  • Gradient descent updates can be derived for each row of W\bold W:
    LCEwk=LCEzkzkwk=(yktk)xwkwkα1Ni=1N(yk(i)tk(i))x(i)\frac{\partial\mathcal{L}_\text{CE}}{\partial\bold w_k}=\frac{\partial\mathcal{L}_\text{CE}}{\partial z_k}\cdot \frac{\partial z_k}{\partial\bold w_k}=(y_k-t_k)\cdot\bold{x}\\ \bold{w}_k\leftarrow \bold{w}_k - \alpha\frac{1}{N}\sum^{N}_{i=1}(y^{(i)}_{k}-t^{(i)}_{k})\cdot \bold{x}^{(i)}

    Again, notice the similarities with linear and logistic regression.

Limit of Linear Classification

Linear Classification if pretty great, but we can only work with linearly separable data. And while we can try to feature map some non linearly separable data, this is not a general solution and can be difficult... so we turn to Neural Networks:

© 2022 Justin Regef. All rights reserved.
Powered by Webnode Cookies
Create your website for free! This website was made with Webnode. Create your own for free today! Get started