ML stuff
Intro
What is ML?
3 types:
Supervised: access to labeled examples of the correct behaviour
- ex: handwriting reader
Reinforcement: learning agent interacts with world and learn to maximize a reward system
- ex: mario player, DaVinci robot sim (ROP)
Unsupervised: no labeled examples - instead, looking for "interesting" patterns in the data
- ex: customer segmentation
Preliminaries and the Nearest Neighbourhood Methods
Machine learning algorithms need to handle lots of data, so we represent the input as a vector in .
Representation: mapping to another space that is easy to manipulate
(vectors are great since we can do linear algebra)
Supervised Learning Setup
Our training set consists of a collection of pairs of an input vector and its corresponding target, or label .
2 types:
Regression
is a real number
Classification
is an element of a discreet set
These days, is often a highly structured object
Training set will look like this:
These superscripts have nothing to do with exponentials!
Nearest Neighbour
Closest (in Euclidean distance):
The Algorithm:
- Given input and training set
- Return where training pair 's minimizes distance
Estimating Model Accuracy
In essence:
- We need to simulate new data. Using the data the model was trained on is not a realistic representation of unseen data.
- We will always get perfect performance for 1-Nearest Neighbor.
We set aside a chunk of our data for testing, called the test set.
Problems
- NN is sensitive to noise or mis-labeled data.
- Solution...
k-Nearest Neighbors
Have the k closest data points "vote".
The Algorithm:
- Given input and training set
- Find k examples closest to
The classification output is majority class:
How do we choose k?
Small :
- Good at capturing fine grained patterns
- May overfit: be sensitive to random idiosyncrasies in the training data.
Large :
- Makes stable predictions by averaging over lots of examples
- May underfit: fail to capture important regularities.
Balancing :
- The optimal choice of depends on the number of data points .
- Rule of thumbs: .
Choice of : Generalization
is a hyperparameter, something we can't fit as part of the learning algorithm itself.
We need to tune it: vary and measure the generalization error using the validation set.
Not test set because test set is supposed to simulate unseen data so we can't use it to tune our model in any way.
Training, test, and validation sets
Partition the data into three subsets.
- Training set: used to train the model
- Validation set: used to tune hyperparameters
- Test set: used to evaluate final performance once, after the hyperparameters are chosen.
Pitfalls
Normalization
- NN can be sensitive to the ranges of different features
Therefore, we normalize each dimension to be zero mean and unit variance.
Computational Cost
- Computations at training time: 0
- Computations at test time per query:
- Calculate D-dimensional Euclidian distances with N data points:
- Sort the distances:
This must be done for each query! very expensive!
- We need to store the entire dataset in memory.
Modular Approach to ML Algorithm Design
- Choose a model describing the relationships between variables of interest
- Define a loss function quantifying how bad the fit to the data is
- Choose a regularizer saying how much we prefer different candidate models (or explanations of data)
- Fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm.
Modular Approach to Supervised Learning
- Training data
Objective: learn the function such that:
Here, is the ground truth and is the prediction.
- A model is a set of assumptions (or restrictions) about the structure of in .
- The model (or architecture) restricts the hypothesis space, or the possible functions that we learn. aka, what could be.
Linear Regression Model
is restricted to linear functions of the input :
- is the prediction
- is the weights
- is the bias
- and together are the parameters
We want to choose good parameter values so that our prediction is close to the target: .
We use the loss function to quantify good or bad values of our parameters:
This means should be "close to" for all .
Loss Function: Square Error Loss
is the residual, and we want to make its magnitude small.
Loss Function vs Cost Function
The cost function is the loss function averaged over all training examples:
What now??
Finding the parameters
We find by minimizing the loss over all the data points:
Let's vectorize the notation.
Vectorizing Linear Regression
Organize all training examples into design matrix () with one row per example, and one column per feature:
In vectorized form we get:
is a column vector of ones
Squared error cost becomes:
Note: sometimes we write , which would correspond to the sum of losses. The optimal do not depend on .
For convenience, we can add a column of 1s to the design matrix and combine the bias and weights to get: .
Solving the minimization problem
We would like to minimize the cost function. Recall: the minimum of a smooth function appears at a critical point, where the derivative is zero.
We'll find a point where the gradient (deriv) is (close to) zero:
- Sometimes it is possible to directly find the parameters that make the gradient zero in a closed-form. We call this the direct solution.
- We may use optimization techniques that iteratively get us closer to the solution.
Direct Solution
Partial derivatives: derivatives of a multivariate function with respect to one of its arguments. Pretend the other variables are constants.
Loss function:
Cost function:
The minimum must occur at a critical point where the partial derivatives are zero: and .
It turns out this gives a system of linear equations, which we can solve efficiently: . Direct solution, yay!
Feature Mapping (Basis Expansion)
The relation between input and output may not be linear.
We can still use linear regression by mapping the input feature to another (higher dimension) space using feature mapping (or basis expansion) and treat the mapped feature (in ) as the input of a linear regression procedure.
Polynomial Feature Mapping and regularizing cost
Map features to polynomials of degree M: (notice these are exponentials)
Here, the feature mapping is
Notice is linear in
M is now a new hyperparameter. We tune it with the validation set. We don't want M to be too big or our model will be too complex, rather we want to keep simpler solutions within the same space of parameters. This is done through regularization or penalization:
The regularized cost function makes a tradeoff between fit to the data and the norm of the weights:
- If you fit you data poorly, is large. If your optimal weights have high values, is large.
- Large penalizes weight values more. Its a new hyperparameter, tune it with the validation set.
For the least square problem, we have .
When , the regularized cost gives:
Probabilistic Interpretation of the Squared Error
For the least squares: we minimize the sum of the squares of the errors between the predictions for each data point and the corresponding target values :
Why do we use the squared error loss to measure the quality of the fit? Here's a probabilistic perspective.
- Suppose our model arose from a statistical model ( for simplicity): where is independent of anything else.
- Thus,
Under these assumptions, the Least-Squares solution is the Maximum-Likelihood solution.
Cf: Maximum Likelihood Estimation (lec 2, page 33)
Gradient Descent
A way of minimizing the cost function that is more broadly applicable.
GD is an iterative algorithm. We initialize the weights to something reasonable (eg: all zeroes) and repeatedly adjust them in the direction of the steepest descent.
- if , then increasing increases
- if , then increasing decreases
We update the weights using the cost function:
is the learning rate. The larger it is, the faster changes. The values are typically small eg: 0.01 or 0.0001. It's a hyperparameter.
- Too small: slow progress → lots of computation
- Too large: instability (might diverge), oscillations
The algorithm gets its name from the gradient:
In vectorized form we update all the weights at the same time:
Notice that when it converges, we get a critical point.
Why gradient descent?
- GD can be applied to a much broader set of models
- GD can be easier to implement than direct solutions
- For regression in high-dimensional spaces, GD is actually more efficient than direct solution
Gradient Descent under the l2 Regularization
The gradient descent update of the regularized cost has an interesting interpretation as weight decay:
The weight in calculations shrinks at each iterations, it decays.
Linear Classification
Classification setup
- Classification: predicting a discrete-values target
- Binary classification: predicting a binary-valued target
- Recall the notation:
- Training data:
- are the inputs
- are the (discreet value) targets, in binary classification with 1 being positive examples and 0 being negative.
- or for computational convenience
Linear: model is a linear function of , followed by a threshold :
We can eliminate the threshold:
We can eliminate the bias by adding a dummy feature that is always 1 and putting the bias in the weight matrix (some with linear regression):
Simplified model
The geometric picture
The Input Space or Data Space
- Training examples are points
- Weights (hypotheses) can be represented by half-spaces
The boundaries of these half-spaces pass through the origin.
- The boundary is the decision boundary:
In 2-D its a line, but think of it as a hyperplane.
- If the training examples can be perfectly separated by a linear decision rule, we say the data is linearly separable.
- If training set is separable, we can solve for using linear programming. Otherwise you'll have to resort to logistic regression.
The Weight Space
- Weights (hypotheses) are points
- Each training example specifies a half-space must lie in to be correctly classified: .
- The region satisfying all the constraints (put on by the examples) is the feasible region; if this region is nonempty, the problem is feasible. Otherwise its infeasable.
Loss Function
- To solve the parameters in a non linearly separable set, we define a loss function then try to minimize the resulting cost function.
- Recall: cost is loss averaged (or summed) over the training set
Seemingly obvious loss functiohalfn: 0-1 loss (different from square loss!)
The problem is that this function has zero derivative almost everywhere, its useless for gradient descent.
- Sometimes we can replace the loss function we care about with one that is easier to optimize. This is known as relaxation with a smooth surrogate loss function.
What happens if we use the square error loss instead? and do it directly on and not :
Aka: doesn't matter the target are binary, treat them as continuous values.
Problem: the loss function penalizes you when you make correct predictions with high confidence!
With , the loss is still larger when than when
There is obviously no reason to predict values outside of .L Let's squash into this interval (and keep the loss optimizable).
The logistic function is a kind of sigmoid:
- is called the logit.
A linear model with a logistic nonlinearity is known as log-linear:
- Used this way, is called an activation function.
- Problem: for , we have .
- If the prediction is really wrong, you should be far from a critical point! Here, you are very close.
Logistic Regression (on cross-entropy loss)
- Problem: let but . Then, if is small enough, it may be numerically zero. This can cause very subtle, hard to find bugs. (log 0 is undefined)
So we combine the activation function with the loss into a single logistic-cross-entropy function:
This gives us numerically stable computation.
Gradient Descent for Logistic Regression
So we get:
Gradient descent (coordinatewise) update to find the weights of logistic regression:
Notice the GD is the same as previously seen (with linear regression). This is normal as they are both examples of generalized linear models.
Stochastic Gradient Descent
So far, the cost function has been the average loss over the all the training examples.
Computing the gradient requires summing all of the training examples. This is known as batch training. Its impractical if you have a large dataset.
In Stochastic Gradient Descent (SGD) we update the parameters based on the gradient of a single training example.
are the parameters (probably idk)
SGD can make significant progress without even seeing all the data!
Mathematical justification: if you sample uniformly at random, SGD is a unbiased estimate of the batch gradient:
- Problems:
- Variance in this estimate may be high
- If we only look at one example at a time, we can't exploit efficient vectorized operations
The compromise: mini-batch.
The size of the mini-batch is now a hyperparameter
- Too large: takes more computation and memory
- Too small: can't exploit vectorization, has high variance
- A reasonable value might be
- Problem:
- Bad minima, but with SGD there is a chance to jump out of it.
Multi-Class Classification
Predicting a discrete(>2)-values target.
Targets form a discrete set
- Represented each target as one-hot vectors, or a one-of-K encoding:
- Represented each target as one-hot vectors, or a one-of-K encoding:
- Now there are input dimensions and output dimensions, so we need a weight matrix . And a -dimensional vector of biases.
Linear predictions:
Vectorized linear predictions:
- Predictions are like probabilities:
For the activation function we use the softmax function, a multivariable generalization of the logistic function:
Input are called logits
Properties:
- Outputs are positive and sum to 1 (like probabilities) when summing all classes
- softmax behaves like argmax. If one of the is much larger than the others,
We can use cross-entropy as the loss function since the model outputs a vector of class probabilities:
Remember, t is a one-hot vector and will be 0 for all k except the correct one.
- Just like logistic regression, we can combine activation with loss to get softmax-cross-entropy function:
Softmax regression:
Gradient descent updates can be derived for each row of :
Again, notice the similarities with linear and logistic regression.
Limit of Linear Classification
Linear Classification if pretty great, but we can only work with linearly separable data. And while we can try to feature map some non linearly separable data, this is not a general solution and can be difficult... so we turn to Neural Networks: