ML stuff

Intro

What is ML?

3 types:

Supervised: access to labeled examples of the correct behaviour
- ex: handwriting reader

Reinforcement: learning agent interacts with world and learn to maximize a reward system
- ex: mario player, DaVinci robot sim (ROP)

Unsupervised: no labeled examples - instead, looking for "interesting" patterns in the data
- ex: customer segmentation

Preliminaries and the Nearest Neighbourhood Methods

Machine learning algorithms need to handle lots of data, so we represent the input as a vector in $\R^d$ .

Representation: mapping to another space that is easy to manipulate

(vectors are great since we can do linear algebra)

Supervised Learning Setup

Our training set consists of a collection of pairs of an input vector $\bold{x}\in\R^d$ and its corresponding target, or label $t$ .

2 types:

Regression
$t$ is a real number

Classification
$t$ is an element of a discreet set $\{1, ..., C\}$

These days, $t$ is often a highly structured object

Training set will look like this: $\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}$

These superscripts have nothing to do with exponentials!

Nearest Neighbour

💡

The idea: find the nearest input vector to

\bold{x}

in the training set and copy its label.

Closest (in Euclidean distance):
$||\bold{x}^{(b)}-\bold{x}^{(b)}||_{2}=\sqrt{\sum^{d}_{j=1}(x_{j}^{(a)}-x_{j}^{(b)})^2}$

The Algorithm:

Given input $\bold{x}$ and training set $\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}$

Return $y=t^{(*)}$ where training pair $(\bold{x}^{(*)}, t^{(*)})$ 's $\bold{x^{(*)}}$ minimizes distance $||\bold{x}-\bold{x}^{(*)}||_{2}$

Estimating Model Accuracy

In essence:
$\text{Accuracy}=\frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$

⚠️

BUT! we cannot use the training set!

We need to simulate new data. Using the data the model was trained on is not a realistic representation of unseen data.

We will always get perfect performance for 1-Nearest Neighbor.

We set aside a chunk of our data for testing, called the test set.

Problems

NN is sensitive to noise or mis-labeled data.
- Solution...

k-Nearest Neighbors

Have the k closest data points "vote".

The Algorithm:

Given input $\bold{x}$ and training set $\{(\bold{x}^{(1)}, t^{(1)}), ..., (\bold{x}^{(N)}, t^{(N)})\}$

Find k examples $(\bold{x}^{(*)}, t^{(*)})$ closest to $\bold{x}$

The classification output is majority class:
$y=\argmax\limits_{t^{(z)}}\sum^{k}_{i=1}\mathbb{I}\{t^{(z)}=t^{(i)}\}$

How do we choose k?

Small $k$ :
- Good at capturing fine grained patterns
- May overfit: be sensitive to random idiosyncrasies in the training data.

Large $k$ :
- Makes stable predictions by averaging over lots of examples
- May underfit: fail to capture important regularities.

Balancing $k$ :
- The optimal choice of $k$ depends on the number of data points $n$ .
- Rule of thumbs: $k=n^{\frac{2}{2+d}}$ .

Choice of $k$ : Generalization

$k$ is a hyperparameter, something we can't fit as part of the learning algorithm itself.

We need to tune it: vary $k$ and measure the generalization error using the validation set.

Not test set because test set is supposed to simulate unseen data so we can't use it to tune our model in any way.

Training, test, and validation sets

Partition the data into three subsets.

Training set: used to train the model

Validation set: used to tune hyperparameters

Test set: used to evaluate final performance once, after the hyperparameters are chosen.

Pitfalls

Normalization

NN can be sensitive to the ranges of different features

Therefore, we normalize each dimension to be zero mean and unit variance.
$\tilde{x}_j=\frac{x_j-\mu_j}{\sigma_j}$

Computational Cost

Computations at training time: 0

Computations at test time per query:
- Calculate D-dimensional Euclidian distances with N data points: $\mathcal{O}(ND)$
- Sort the distances: $\mathcal{O}(N\log N)$
This must be done for each query! very expensive!

We need to store the entire dataset in memory.

Modular Approach to ML Algorithm Design

Choose a model describing the relationships between variables of interest

Define a loss function quantifying how bad the fit to the data is

Choose a regularizer saying how much we prefer different candidate models (or explanations of data)

Fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm.

Modular Approach to Supervised Learning

Training data $\mathcal{D}=\{(\bold{x}^{(i)}, t^{(i)} \text{ for }i=1,2,...,N)\}$

Objective: learn the function $f$ such that:
$t\approx y = f(x)$
Here, $t$ is the ground truth and $y$ is the prediction.

A model is a set of assumptions (or restrictions) about the structure of $f$ in $t\approx y = f(x)$ .
- The model (or architecture) restricts the hypothesis space, or the possible functions $f$ that we learn. aka, what $f$ could be.

Linear Regression Model

$f$ is restricted to linear functions of the input $\bold{x}=(x_1,...,x_D)$ :
$y=f(\bold{x})=\sum_{j}w_jx_j+\mathcal{b}$
- $y$ is the prediction
- $\bold{w}$ is the weights
- $\mathcal{b}$ is the bias
  - $\bold{w}$ and $\mathcal{b}$ together are the parameters

We want to choose good parameter values so that our prediction is close to the target: $y\approx t$ .

We use the loss function $\mathcal{L}$ to quantify good or bad values of our parameters:
$\argmin_{(\bold{w},\mathcal{b})}\sum^{N}_{i=1}\mathcal{L}(y^{(i)},t^{(i)})$

This means $y^{(i)}=wx^{(i)}+\mathcal{b}$ should be "close to" $t^{(i)}$ for all $i$ .

Loss Function: Square Error Loss

\mathcal{L}(y,t)=\frac{1}{2}(y-t)^2

$y-t$ is the residual, and we want to make its magnitude small.

Loss Function vs Cost Function

The cost function is the loss function averaged over all training examples:
$\mathcal{J}(\bold{w},\mathcal{b})=\frac{1}{2N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})^2=\frac{1}{2N}\sum^{N}_{i=1}(\bold{w}^\top \bold{x}^{(i)}+\mathcal{b}-t^{(i)})^2$

What now??

Finding the parameters

We find $(\bold{w}, \mathcal{b})$ by minimizing the loss over all the data points:
$\argmin_{(\bold{w}, \mathcal{b})}=\sum^{N}_{i=1}\mathcal{L}(y^{(i)},t^{(i)})$

Let's vectorize the notation.

Vectorizing Linear Regression

Organize all training examples into design matrix $\bold X$ ( $N\times \mathcal{D}$ ) with one row per example, and one column per feature:
$\bold{X}=\begin{pmatrix}\bold{x}^{(1)^\top} \\ \bold{x}^{(2)^\top}\\...\\\bold{x}^{(N)^\top}\end{pmatrix}=\begin{pmatrix}x^{1}_{1} & x^{1}_{2} & ... & x^{1}_{\mathcal{D}}\\x^{2}_{1} & x^{2}_{2} & ... & x^{2}_{\mathcal{D}}\\...&...&...&...\\x^{N}_{1} & x^{N}_{2} & ... & x^{N}_{\mathcal{D}}\end{pmatrix}$
$t=\begin{pmatrix}t^{(1)}\\...\\t^{(N)}\end{pmatrix}$

In vectorized form we get:
$\bold{Xw}+b\bold{1}=\begin{pmatrix}\bold{w^\top}\bold{x}^{(1)}+b\\...\\\bold{w^\top}\bold{x}^{(N)}+b\end{pmatrix}=\begin{pmatrix}y^{(i)}\\...\\y^{(N)}\end{pmatrix}=\bold{y}$
$\bold{1}$ is a column vector of ones

Squared error cost becomes:
$\bold{y=Xw}+b\bold{1}$
$\mathcal{J}=\frac{1}{2N}||\bold{y-t}||^2$
Note: sometimes we write $\mathcal{J}=\frac{1}{2}||\bold{y-t}||^2$ , which would correspond to the sum of losses. The optimal $(\bold{w}, \mathcal{b})$ do not depend on $N$ .

For convenience, we can add a column of 1s to the design matrix and combine the bias and weights to get: $\bold{y=Xw}$ .

Solving the minimization problem

We would like to minimize the cost function. Recall: the minimum of a smooth function appears at a critical point, where the derivative is zero.

We'll find a point where the gradient (deriv) is (close to) zero:

Sometimes it is possible to directly find the parameters that make the gradient zero in a closed-form. We call this the direct solution.

We may use optimization techniques that iteratively get us closer to the solution.

Direct Solution

Partial derivatives: derivatives of a multivariate function with respect to one of its arguments. Pretend the other variables are constants.

Loss function:
$\frac{\partial\mathcal{L}}{\partial w_j}=\frac{\partial\mathcal{L}}{\partial y}\frac{\partial y}{\partial w_j}=(y-t)x_j$
$\frac{\partial\mathcal{L}}{\partial b}=y-t$

Cost function:
$\frac{\partial \mathcal{J}}{\partial w_j}=\frac{1}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x^{(i)}_{j}$
$\frac{\partial \mathcal{J}}{\partial b}=\frac{1}{N}\sum^{N}_{i=1}y^{(i)}-t^{(i)}$

The minimum must occur at a critical point where the partial derivatives are zero: $\frac{\partial \mathcal{J}}{\partial w_j}=0 , (\forall j)$ and $\frac{\partial \mathcal{J}}{\partial b}=0$ .

It turns out this gives a system of linear equations, which we can solve efficiently: $\bold{w}^{\text{LS}}=\bold{(X^\top X)^{-1} X^\top t}$ . Direct solution, yay!

Feature Mapping (Basis Expansion)

The relation between input and output may not be linear.

We can still use linear regression by mapping the input feature to another (higher dimension) space using feature mapping (or basis expansion) $\bold{\psi(x)}:\R^D\rightarrow\R^d$ and treat the mapped feature (in $\R^d$ ) as the input of a linear regression procedure.

Polynomial Feature Mapping and regularizing cost

Map features to polynomials of degree M: $y=w_0+w_1x+w_2x^2+...w_Mx^M=\sum^{M}_{i=0}w_ix^i$ (notice these are exponentials)

Here, the feature mapping is $\psi(x)=[1,x,x^2,...]^\top$

Notice $y=\psi(x)^\top \bold{w}$ is linear in $w_0,w_1,...$

M is now a new hyperparameter. We tune it with the validation set. We don't want M to be too big or our model will be too complex, rather we want to keep simpler solutions within the same space of parameters. This is done through regularization or penalization: $\mathcal{R}(\bold{w})=\frac{1}{2}||\bold{w}||^{2}_{2}=\frac{1}{2}\sum_{j}w^{2}_{j}$

The regularized cost function makes a tradeoff between fit to the data and the norm of the weights:
$\mathcal{J}_\text{reg}(\bold{w})=\mathcal{J}(\bold{w})+\lambda\mathcal{R}(\bold{w})=\mathcal{J}(\bold w)+\frac{\lambda}{2}\sum_{j}w^{2}_{j}$

If you fit you data poorly, $\mathcal{J}$ is large. If your optimal weights have high values, $\mathcal{R}$ is large.

Large $\lambda$ penalizes weight values more. Its a new hyperparameter, tune it with the validation set.

For the least square problem, we have $\mathcal{J}(\bold w)=\frac{1}{2N}||\bold{Xw-t}||^2$ .

When $\lambda >0$ , the regularized cost gives:
$\bold{w}_\lambda^{\text{Ridge}}=\argmin_\bold{w}\mathcal{J}_\text{reg}(\bold w)=\argmin_\bold{w}\frac{1}{2N}||\bold{Xw-t}||^2_2+\frac{\lambda}{2}||\bold{w}||^2_2=\bold{(X^\top X+\lambda}\text{N}\bold{I})^{-1}\bold{X^\top t}$

Probabilistic Interpretation of the Squared Error

For the least squares: we minimize the sum of the squares of the errors between the predictions for each data point $\bold{x}^{(i)}$ and the corresponding target values $t^{(i)}$ :
$\argmin_{(\bold{w},b)}\sum^{n}_{i=0}(\bold{w^\top x}^{(i)}+b-t^{(i)})^2$

Why do we use the squared error loss to measure the quality of the fit? Here's a probabilistic perspective.

Suppose our model arose from a statistical model ( $b=0$ for simplicity): $y^{(i)}=\bold{w^\top x}^{(i)}+\epsilon^{(i)}$ where $\epsilon^{(i)}\sim\mathcal{N}(0, \sigma^2)$ is independent of anything else.

Thus, $y^{(i)}|\bold{x}^{(i)}\sim p(y|\bold{x}^{(i)}, \bold{w})=\mathcal{N}(\bold{w^\top x}^{(i)},\sigma^2)$

Under these assumptions, the Least-Squares solution is the Maximum-Likelihood solution.

Cf: Maximum Likelihood Estimation (lec 2, page 33)

Gradient Descent

A way of minimizing the cost function that is more broadly applicable.

GD is an iterative algorithm. We initialize the weights to something reasonable (eg: all zeroes) and repeatedly adjust them in the direction of the steepest descent.

x\leftarrow x - \alpha f'(x)

if $\frac{\partial \mathcal{J}}{\partial w_j}>0$ , then increasing $w_j$ increases $\mathcal{J}$

if $\frac{\partial \mathcal{J}}{\partial w_j}<0$ , then increasing $w_j$ decreases $\mathcal{J}$

We update the weights using the cost function:
$w_j\leftarrow w_j-\alpha\frac{\partial\mathcal{J}}{\partial w_j}=w_j-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x_j^{(i)}$

$\alpha$ is the learning rate. The larger it is, the faster $\bold{w}$ changes. The values are typically small eg: 0.01 or 0.0001. It's a hyperparameter.
- Too small: slow progress → lots of computation
- Too large: instability (might diverge), oscillations

The algorithm gets its name from the gradient:
$\frac{\partial \mathcal{J}}{\partial \bold{w}}=\begin{pmatrix}\frac{\partial\mathcal{J}}{\partial w_1}\\...\\\frac{\partial\mathcal{J}}{\partial w_D}\end{pmatrix}$

In vectorized form we update all the weights at the same time:
$\bold{w}\leftarrow \bold{w}-\alpha\frac{\partial\mathcal{J}}{\partial \bold{w}}=\bold{w}-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})\bold{x}^{(i)}$

Notice that when it converges, we get a critical point.

Why gradient descent?
- GD can be applied to a much broader set of models
- GD can be easier to implement than direct solutions
- For regression in high-dimensional spaces, GD is actually more efficient than direct solution

Gradient Descent under the l2 Regularization

The gradient descent update of the regularized cost $\mathcal{J}+\lambda\mathcal{R}$ has an interesting interpretation as weight decay:
$\bold{w}\leftarrow\alpha\left(\frac{\partial\mathcal{J}}{\partial\bold w}+\lambda\frac{\partial\mathcal{R}}{\partial\bold w}\right)=(1-\alpha\lambda)\bold{w}-\alpha\frac{\partial\mathcal{J}}{\partial \bold{w}}$
The weight in calculations shrinks at each iterations, it decays.

Linear Classification

Classification setup

Classification: predicting a discrete-values target
- Binary classification: predicting a binary-valued target

Recall the notation:
- Training data: $(\bold{x}^{(1)}, t^{(1)}),(\bold{x}^{(2)}, t^{(2)}),...(\bold{x}^{(N)}, t^{(N)})$
- $\bold{x}^{(i)}$ are the inputs
- $t^{(i)}$ are the (discreet value) targets, in binary classification $t\in\{0,1\}$ with 1 being positive examples and 0 being negative.
  - $\{0,1\}$ or $\{-1, +1\}$ for computational convenience

Linear: model is a linear function of $\bold{x}$ , followed by a threshold $r$ :
$z=\bold{w^\top x}+b\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq r \\0 \quad\text{ if }\medspace z<r\end{cases}$

We can eliminate the threshold:
$\bold{w^\top x}+b\geq r \Longleftrightarrow \bold{w^\top x} + b - r \geq 0$

We can eliminate the bias by adding a dummy feature that is always 1 and putting the bias in the weight matrix (some with linear regression):
$x_0=1 \medspace\land \medspace w_0=b$

Simplified model
$z=\bold{w^\top x}\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq 0 \\0 \quad\text{ if }\medspace z< 0\end{cases}$

The geometric picture

The Input Space or Data Space

Training examples are points

Weights (hypotheses) $\bold w$ can be represented by half-spaces $H_+=\{\bold{x:w^\top x}\geq0\}, \medspace H_-=\{\bold{x:w^\top x}<0\}$
The boundaries of these half-spaces pass through the origin.

The boundary is the decision boundary: $\{\bold{x:w^\top x}=0\}$
In 2-D its a line, but think of it as a hyperplane.

If the training examples can be perfectly separated by a linear decision rule, we say the data is linearly separable.
- If training set is separable, we can solve for $\bold w, b$ using linear programming. Otherwise you'll have to resort to logistic regression.

The Weight Space

Weights (hypotheses) $\bold w$ are points

Each training example $\bold x$ specifies a half-space $\bold w$ must lie in to be correctly classified: $\bold{w^\top x}>0 \text{ if } t=1$ .

The region satisfying all the constraints (put on by the examples) is the feasible region; if this region is nonempty, the problem is feasible. Otherwise its infeasable.

Loss Function

To solve the parameters in a non linearly separable set, we define a loss function then try to minimize the resulting cost function.
- Recall: cost is loss averaged (or summed) over the training set

Seemingly obvious loss functiohalfn: 0-1 loss (different from square loss!)
$z=\bold{w^\top x}\\y=\begin{cases}1 \quad\text{ if } \medspace z\geq 0 \\0 \quad\text{ if }\medspace z< 0\end{cases}$
$\mathcal{L}_{0-1}(y,t)=\begin{cases}0 \quad\text{ if } \medspace y=t\\1 \quad\text{ if } \medspace y\neq t\end{cases}=\mathbb{I}[y\neq t]$
The problem is that this function has zero derivative almost everywhere, its useless for gradient descent.

Sometimes we can replace the loss function we care about with one that is easier to optimize. This is known as relaxation with a smooth surrogate loss function.

What happens if we use the square error loss instead? and do it directly on $z$ and not $y$ :
$z=\bold{w^\top x}+b$
$\mathcal{L}_{\text{SE}}(z,t)=\frac{1}{2}(z-t)^2$
Aka: doesn't matter the target are binary, treat them as continuous values.
Problem: the loss function penalizes you when you make correct predictions with high confidence!
With $t=1$ , the loss is still larger when $z=10$ than when $z=0$

There is obviously no reason to predict values outside of $[0,1]$ .L Let's squash $y$ into this interval (and keep the loss optimizable).

The logistic function is a kind of sigmoid:
$y=\sigma(z)=\frac{1}{1+e^{-z}}$
- $\sigma^{-1}(y)=\log(\frac{y}{(1-y)})$ is called the logit.
- A linear model with a logistic nonlinearity is known as log-linear:
  $z=\bold{w^\top x}+ b\\y=\sigma(z)\\\mathcal{L}_{\text{SE}}(y,t)=\frac{1}{2}(y-t)^2$
- Used this way, $\sigma$ is called an activation function.
- Problem: for $z\ll0$ , we have $\sigma(z)\approx 0$ .
  - If the prediction is really wrong, you should be far from a critical point! Here, you are very close.

Cross entropy loss:
- Because $y\in [0,1]$ we can interpret it as the estimated probability that $t=1$ .
$\mathcal{L}_\text{CE}(y,t)=\begin{cases}-\log y \quad \text{if } t=1\\-\log(1-y)\quad \text{if } t=0\end{cases}=-t\log y - (1-t)\log(1-y)$

Logistic Regression (on cross-entropy loss)

z=\bold{w^\top x} + b\\y=\sigma(z)=\frac{1}{1+e^{-z}} \quad \text{(logit)}\\\mathcal{L}_\text{CE}=-t\log y-(1-t)\log(1-y)

Problem: let $t=1$ but $z\ll0$ . Then, if $y$ is small enough, it may be numerically zero. This can cause very subtle, hard to find bugs. (log 0 is undefined)

So we combine the activation function with the loss into a single logistic-cross-entropy function:
$\mathcal{L}_\text{LCE}(z,t)=\mathcal{L}_\text{CE}(\sigma(z), t)=t\log(1+e^{-z})+(1-t)\log(1+e^{z})$
This gives us numerically stable computation.

Gradient Descent for Logistic Regression

z=\bold{w^\top x} + b\\y=\sigma(z)=\frac{1}{1+e^{-z}} \quad \text{(logit)}\\\mathcal{L}_\text{CE}=-t\log y-(1-t)\log(1-y)

So we get:

\frac{\partial\mathcal{L}_\text{CE}}{\partial w_j}=\frac{\partial\mathcal{L}_\text{CE}}{\partial y}\cdot \frac{\partial y}{\partial z}\cdot \frac{\partial\ z}{\partial w_j}=\left(-\frac{t}{y}+\frac{1-t}{1-y}\right)\cdot y(1-y)\cdot x_j=(y-t)x_j

Gradient descent (coordinatewise) update to find the weights of logistic regression:

w_j\leftarrow w_j - \alpha\frac{\partial\mathcal{J}}{\partial w_j}=w_j-\frac{\alpha}{N}\sum^{N}_{i=1}(y^{(i)}-t^{(i)})x^{(i)}_j

Notice the GD is the same as previously seen (with linear regression). This is normal as they are both examples of generalized linear models.

Stochastic Gradient Descent

So far, the cost function $\mathcal{J}$ has been the average loss over the all the training examples.

Computing the gradient requires summing all of the training examples. This is known as batch training. Its impractical if you have a large dataset.

In Stochastic Gradient Descent (SGD) we update the parameters based on the gradient of a single training example.
$\text{Choose } i\text{ uniformly at random}\\ \theta\leftarrow\theta -\alpha\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}$
$\theta$ are the parameters (probably idk)

SGD can make significant progress without even seeing all the data!

Mathematical justification: if you sample uniformly at random, SGD is a unbiased estimate of the batch gradient:
$\mathbb{E}\left[\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}\right]=\frac{1}{N}\sum^{N}_{i=1}\frac{\partial\mathcal{L}^{(i)}}{\partial\theta}=\frac{\partial\mathcal{J}}{\partial\theta}$

Problems:
- Variance in this estimate may be high
- If we only look at one example at a time, we can't exploit efficient vectorized operations

The compromise: mini-batch. $\mathcal{M}\sub\{1,...,N\}$

The size of the mini-batch $|\mathcal{M}|$ is now a hyperparameter
- Too large: takes more computation and memory
- Too small: can't exploit vectorization, has high variance
- A reasonable value might be $|\mathcal{M}|=100$

Problem:
- Bad minima, but with SGD there is a chance to jump out of it.

Multi-Class Classification

Predicting a discrete(>2)-values target.

Targets form a discrete set $\{1,...,K\}$
- Represented each target as one-hot vectors, or a one-of-K encoding:
  $\bold{t}=(0,...,0,1,0,...,0)\in\R^K$

Now there are $D$ input dimensions and $K$ output dimensions, so we need a $K\times D$ weight matrix $\bold W$ . And a $K$ -dimensional vector $\bold b$ of biases.

Linear predictions:
$z_k=\sum^{D}_{j=1}w_{kj}x_j+b_k \text{ for class }k\in\{1,2,...,K\}$

Vectorized linear predictions:
$\bold{z=Wx+b}$

Predictions are like probabilities: $1\geq y_k \geq 0 \quad\land\quad\sum_k y_k=1$

For the activation function we use the softmax function, a multivariable generalization of the logistic function:
$y_k=\text{softmax}(z_1,...,z_K)_k=\frac{e^{z_k}}{\sum_{k'}e^{z_{k'}}}$
Input $z_k$ are called logits
Properties:
- Outputs are positive and sum to 1 (like probabilities) when summing all $K$ classes
- softmax behaves like argmax. If one of the $z_k$ is much larger than the others, $\text{softmax}(\bold z)_k\approx1$

We can use cross-entropy as the loss function since the model outputs a vector of class probabilities:
$\mathcal{L}_\text{CE}(\bold{y,t})=-\sum^{K}_{k=1}t_k\log{y_k}=-\bold{t^\top}(\log \bold y)$
Remember, t is a one-hot vector and will be 0 for all k except the correct one.

Just like logistic regression, we can combine activation with loss to get softmax-cross-entropy function:

Softmax regression:
$\bold{z=Wx+b}\\\bold{y}=\text{softmax}(\bold z)\\\mathcal{L}_\text{CE}=-\bold{t^\top(\log y)}$

Gradient descent updates can be derived for each row of $\bold W$ :
$\frac{\partial\mathcal{L}_\text{CE}}{\partial\bold w_k}=\frac{\partial\mathcal{L}_\text{CE}}{\partial z_k}\cdot \frac{\partial z_k}{\partial\bold w_k}=(y_k-t_k)\cdot\bold{x}\\ \bold{w}_k\leftarrow \bold{w}_k - \alpha\frac{1}{N}\sum^{N}_{i=1}(y^{(i)}_{k}-t^{(i)}_{k})\cdot \bold{x}^{(i)}$
Again, notice the similarities with linear and logistic regression.

Limit of Linear Classification

Linear Classification if pretty great, but we can only work with linearly separable data. And while we can try to feature map some non linearly separable data, this is not a general solution and can be difficult... so we turn to Neural Networks: