# Machine Learning - CS7641

# Lesson 1: ML is the ROX

- What is Machine Learning?
- Building computational artifacts that learn over time is one definition. Other definitions include computational statistics.
- Supervised Learning, unsupervised learning, reinforcement learning is how the class is structured.
- Supervised Learning (function approximation)
- Taking labeled data sets, performing analysis on them so that you can create new labeled data sets.
- A lot about induction as opposed to deduction.
- Induction is about taking examples and going to a general rule. Specifics to generalities.
- Deducation is about going from general rule to specific instances. Generalities to specific instances.

- Unsupervised Learning
- Given some inputs we have to derive some relation based on the inputs.
- Animal classification is an example.
- This is about description. How to take a set of data and divide it up.
- Looking at a crowd of people how would you divide them up? By hair, ethnicity, clothing, etc.. That is what we are doing with unsupervised learning.
- Nothing is telling you how it can be divided up.
- Pixels -> Function Approximation -> Labels of input
- There is no signal telling you which method is better as far as dividing the input up goes; whereas in supervised learning, we do have that signal

- Reinforcement Learning
- Learning from delayed reward. The feedback may come several steps after a decision is made.
- An example is tic tac toe. You want to know which decisions will lead to a win.]
- This is harder because you don’t know what is the end output and there are no signals coming in.

- Comparison
- In supervised learning we want to find a function that scores data well
- In unsupervised learning we want to find a cluster that scores well
- In reinforcement learning we want to find behavior that scores well

# Lesson 2: Decision Trees

- Supervised Learning
- Two types of supervised learning are:
- Classification
- Regression

- Classification is the process of taking an input
`x`

and mapping it to some**discrete**label like`True`

or`False`

. For example, given a picture is this person a male or female is the classification problem. - Regression is about
**continuous**value functions. Given a specific point, calculate a point later. - To figure out the difference between the two, you must look at the output. Is the output a discreet quantity like yes/no or is it a continuous quantity like a range.
- Classification Learning
- Terms:
**Instances**: Input vectors of values such as pixels, credit scores, etc.**Concept**: The function that maps the instances to outputs such as true or false. This is an idea that describes a set of things.**Target Concept**: The thing we are trying to find. The answer.**Hypothesis Class**: Set of all concepts you are willing to entertain. Could be all possible functions but it would be hard to determine which are right given finite data.**Sample**: Training set of data. Set of all inputs paired with the correct output. Example is {picture, credit worthy or not}. Hot dog or not hot dog. Example of inductive learning (generalize from given examples).**Candidate**: The concept that you think might be the target concept. You could say that someone with curly hair will be credit worthy and that is your function so you try to see if that function holds based on the testing set.**Testing Set**: Looks like a training set similar to the Sample and it will determine if it is correct. This is not the same as the sample/training set.

- Terms:
- Decision Trees
- Representation vs Algorithm
- A decision tree has a decision node with an attribute that asks a question. Theses values are represented by the edges. Squares are the output and final decision.

- Learning
- Pick Best attribute that splits the data
- Ask Question
- Depending on the answer, pick another attribute. Follow the answer path.
- Go to step 1 until you have an answer

- Expressiveness
- Simple Boolean functions
- AND

{ % asset_img N5K5DO.png %}- It is communtative so you could switch the A and B nodes around.

- OR

{ % asset_img Uu7Bqs.png %}- You can swap the nodes as well.

- XOR

{ % asset_img Dv2PDR.png %}

- AND
- N-attributes
- You need
`n`

nodes, one for each attributes, to represent the OR for`n`

attributes. This is linear. - For XOR this is represented by parity. If the number of attributes that are true are odd then the output of the function is true else false; if you define odd parity. You need
`2^N`

to represent this. This is exponential.

{ % asset_img LbNapT.png %}- This is a hard problem.

- You need

- Simple Boolean functions

- Representation vs Algorithm
- ID3
- This algorithm is an efficient way to solve the decision tree problem

{ % asset_img 7in5FS.png %} - What is a best attribute?
- Most common is information gain which is a mathematical way to describe the amount of information gained.

{ % asset_img M3blbA.png %} - Choose the attribute that will maximize information gain as defined by the formula above.

- Most common is information gain which is a mathematical way to describe the amount of information gained.
- Bias
- Inductive Bias of ID3
- ID3 will prefer trees that has good splits near the top
- It prefers correct over incorrect models.
- Prefers shorter trees but this is a byproduct of the first point of prefering good splits.

- Restriction Bias: Hypothesis set, all possible decision trees,
- Preference Bias: Which hypothesis we prefer from the hypothesis set.

- Inductive Bias of ID3
- Other considerations
- What happens if we have continuous attributes like age, weight, distance? Use ranges. You can ask the same question again in continuous input sets to trim the ranges.
- When do we stop?
- When everything is classified correctly
- No more attributes.
- No overfitting (tree is too big)
- Pruning to get a smaller tree

- Regression
- Splitting? Variance?
- Output: average, local linear fit

- Splitting? Variance?

- This algorithm is an efficient way to solve the decision tree problem

- Two types of supervised learning are:

# Lesson 3: Regression & Classification

- What is Regression?
- Regression to the mean/average.

- Cross Validation
- This is a way to validate that the polynomial/best fitting function you select will fit the test data without actually using the test data.
- The idea is to split the training data and take out a part of it so that you can compare how your model did to predict the data set you took out as shown in the image below. You are going to pick the polynomial with the lowest error.

- The input can be scalar and output continuous.
- The input can be a vector of features with a continuous output.

# Lesson 4: Neural Networks

- Perceptron Rule
- The purpose is to generate a half plane that will separate the input if it is linearly separable. If it is not linearly separable then you cannot run the algorithm for perceptron rule. But we don’t know if a problem is linearly separable.
- The loop will continue to run while there is an error.

- Gradient Descent
- Useful for non-linearly separable inputs.
- This is not thresholded like the previous.
- Define an error metric on the weight vector.

- Comparing learning rules
- The gradient descent rule is more robust but will converge only to local optimum

- Sigmoid - Differentiable threshold http://mathworld.wolfram.com/SigmoidFunction.html
- Backpropagation is using computationally beneficial organization of the chain rule to produce what you want it to. The errors are flowing backwards.
- There are other methods other than gradient descent. Some people believe learning is just optimization. You can use momentum terms, higher order derivatives, randomized optimization, penalty for “complexity”.
- Restriction Bias
- What are neural nets best used for?
- Representational power and set of hypothesis we will consider.
- Perceptrons are half spaces
- Sigmoids: much more complex. not more restriction.
- Boolean: network of threshold-like units can be represented
- Continuous: As the input and output changes smoothly with no discontinuity, can it be modeled? Yes, as long as it has no jumps then it can have one hidden layers. Each hidden layer will worry about one specific patch of the input.
- Arbitrary: We can stitch things together using two hidden layers.
- Neural Networks are not very restrictive in their bias as long at it is sufficiently complex network structures (multiple hidden layers).
- There is a danger in overfitting. Cross validation can help alieviate this concern.
- Errors are going down as you are iterating and training. The magnitudes of the weights is important too so that you don’t go overboard.

- Preference Bias
- Given two representations why prefer one over the other by the algorithm?
- What algorithm? How do the weights start off in gradient descent?
- Typical to use small random values for the initial weights. Helps avoid local minimum. It also gives some variability. We use small values because if the weights are very big then you can overfit and it keeps low complexity. We prefer simple explanations. Prefer correct over incorrect but all things being equal prefer the simple explanation.

- Occam’s Razor: Entities should not be multiplied unnecessarily. It is necessary if you are getting better at fitting the data if you are not fitting better then you are making it more complex by multiplying. If errors are the same then you are not making it simpler.

# Lesson 5: Instance Based Learning

Another type of learning algorithm.

Previously we were given a bunch of training data and then we would make a line that best fits the data. Alternatively we could take the training data and put it in a database and when we get a new datapoint we look it up. This method won’t forget previous instances. There is no wasted time learning and rapidly puts it in the database. This is also simple.

`f(x)=lookup(x)`

, but if it hasn’t seen something before then it won’t generalize a solution. It models exactly what it was given and leads to overfitting; believing the data too much and having too much noise. If you have the same x, but different y, then it will return both or prefer one of them.The example shown is a bunch of houses on a map. We need to figure out the cost of some houses. Red are expensive houses, blue moderate, and green are cheap. For black dots, find the cost of the houses. These points are not in the database, so you have to guess. Looking at the houses near it is a good way to guess (nearest neighbor). The black one surrounded by the red would be red as a guess. But the middle black one is close the red, green, and blue point; so maybe look at more neighbors than the closest ones.

When you look at the closest ones then you should determine the cost perhaps by distance. How many neighbors should you select? This is the

*k*-nearest neighbors. You could also use square feet comparison or anything else but distance is a stand-in for similarity between all points; features that are consistent across all points for comparison.Algorithm

*K*-Nearest-Neighbor- For a classification problem return, you should vote. Find the plurality (which occurs the most) of the NN. If there are ties, just pick one.
- For a regression case you are dealing with numbers, you should return the mean of the numbers and you won’t have to worry about a tie.
- A lot of the decisions are left up to the designer to determine the distance metric, voting manner, how to determine mean, etc. and all of this will get you different answers. You could add in weighting too to prefer certain neighbors more than others; maybe something like 1/distance_metric()
- Linear Regression is eager learner and K-NN is lazy learner at query time.
- Preference Bias is our belief about what makes a good hypothesis and prefering one over another.
- Locality -> near points are similar. This is embedded by the distance function. Euclidean distance makes a different assumption to Manhattan distance. So each function will give a different assumption. There are good and bad distance functions and how you choose them is bias.
- Smoothness -> averaging. By choosing to average we are expecting function to behave smoothly.
- All features matter equally

The Curse of Dimensionality

- As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

Some other stuff

- Some distance metrics are Euclidean, manhattan, weighted versions of those. Choice of distance function really matters! Weight is one way to deal with curse of dimensionality to favor one feature over others. Distances might be mismatches.
- No good way to pick k so it is arbitrary. But if k = n then you get a constant function but if you did a weighted average then it will vary.

# Lesson 6: Ensemble Learning

- Boosting is a type of ensemble learning algorithm
- Think about Spam email. Don’t think of complicated rules to identify spam email. Try simple rules.
- The videos say that having “manly” in the body of the message is weird so probably spam.

- From: spouse is not indicative of spam, most likely.

- Message length is short
- Just a single image
- p0rn and misspelled word. a blacklist of words
- “make.money.fast” was a spam message

- You take a bunch of simple rules and create a complex rule by combining them in some way.
- Basic algorithm
- Learn over a subset of data and generate a rule
- Repeat 1 until a certain number of rules
- Combine all rules

- How do you choose subsets? One of the simplest ways is to pick uniformly randomly choose some of the data and apply a learner.
- How do you combine? Take the mean of the result.
- Randomly picking and combining by taking the mean will give you bagging.
- There is an alternative:
- Rather than picking randomly pick a subset of points that are “hard”.
- For combining we will still use mean but it will be weighted mean.

- Error: This is the number of mismatches
- Learning only happens if you distribution is the same between the training and validation set.
- The distribution is D. The h is the hypothesis that the learner has output. C is the true underlying concept.
- A weak learner is a learning algorithm that is weak in the sense that it is a learner that no matter what the distribution is on the data, it will do better than chance. An error rate will always be < 0.5.
- Boosting in Code
- How do you construct the distribution?
- Distribution starts over all of the examples starting at 1. If there are n examples, you have 1/n distributions.
- At every time step T, you construct a new distribution as shown below.
- When there is agreement, 1 otherwise -1 for the yi*ht part. alpha_t is always a positive number.
- This will favor harder problems.

- How do you get hypothesis?

- How do you construct the distribution?

# Lesson 7: Kernel Methods and Support Vector Machines

- Don’t overfit by selecting the other lines. The middle line is the only one that doesn’t over commit to the minus and the positive sides. If new data came in, then it is equally likely to fall on either side. Finding the line of least commitment is the basic idea behind SVM.
- The 3 horizontal lines can have a distance between the one closest to the minus and the line closest to the positives.
- y is the classification label. Positive, yes you are a part of the class, negative you are not.
- The orange line is as far away from the data as possible while still being consistent. We want to find what that line is.

- 2/||w|| where ||w|| is the length of w. We want to maximize this while classifying everything correctly.

# Lesson 8: Computational Learning Theory

Know what kind of problem you are solving to know which algorithm to choose. Using the best algorithm isn’t necessarily the best approach.

SVMs are going to find linear separation so you’ll only have a single line. Also same for perceptrons.

k-nearest neighbor will produce a separate into regions that are very similar so all +s and -s are grouped and separated.

DTTs give a split.

Define a learning problem, what should a learning algorithm do

Once we know that we can show specific algorithms working and that some problems are fundamentally hard for algorithms in a certain class of algorithms.

- This is basically algorithm analysis for machine learning as opposed to other CS algorithms analysis.

Usually we analyze algorithms in time and space complexity. In computational learning theory the following resources matter: time, space, data/samples/examples. Sample size impacts machine learning algorithm performance in terms of accuracy.

Defining Inductive Learning (Learning from examples)

- Probability of successful training
`1 - delta`

as probablility of success - Number of examples to train on.
`m`

- Compelxity of hypothesis class
`complexity(H)`

high complexity will result in overfitting. - Accuracy to which target concept is approximated.
`E`

epsilon - Manner in which training examples presented.
`batch/online`

batch means that the training set is fixed and provided to the algo. Online means that you are presented samples one at a time. - Manner in which training examples selected.

The Learner will asks questions to the teacher. Asks what is c(x)

Teacher will give examples to the learner to help the learner. Teacher chooses x, tells c(x)

Fixed distribution where x chosen from D by nature which is what we have been doing so far.

Evil or worst distribution.

If the teacher asks the questions then they would ask the question that would give the best answer as fast as possible.

If the learner asks the questions then you basically have a tree where each node is a question and then each branch is yes or no then you can get log_2(|H|) where H is the set of possible people.

A teacher

*could*ask in the set of all possible questions but in real life you will be constrained. This is constrained queriesFor learner with constrained queries it will take exponential time to get the hypothesis class that gives you a 1.

- But why do you get
`log_2(|H|)`

above? That is because you are asking if x1 is positive, absent, or negated for all variables. Negations give a lot of guesses that are useless that don’t give you a 1.

- But why do you get
What if you use a learner with mistake bounds?

- The learner sits around then an input arrives and the learner can then guess the answer. If the learner is wrong then it is charged a point and it will go back to 1. We will bound the total number of mistakes to a certain amount. Now the learner can guess answers and learn a lot from the wrong answers. You can start learning better once you get a 1.
- Algorithm:
- Assume that each variable is positive and negated.
- Given input, compute the output
- If wrong, set all positive variables that were 0 to absent, negate variables that were 1 to absent. Go to 2.

Definitions

- Learner choses
- Teacher chooses
- Nature chooses
- Mean teacher
- With mistake bound learner, it is still efficent.
**Computational Complexity**How much computational effort is needed for a learner to converge to an answer?**Sample Complexity**In the case of a batch, how many training examples are needed for a learner to create a successful hypothesis?**Mistake Bounds**In the case of online examples, how many misclassifications acan a learner make over an infinite run?

Version Spaces

- A consistent learner produces hypothesis classes that are consistent with the data.
- Version space is the set of hyptohesis that are consistent with the data samples.

PAC (Probably Approximately Correct) Learning - Error of h(ypothesis)

- Training Error - fraction of training examples that are misclassified by h.
- True Error - fraction of examples that would be misclassified on sample drawn from D.
- Epsilon exhausted version space
- A version space for a sample space is exhausted if and only if for all the hypothesis in the version space the error(h) <= epsilon. Everything you might possibly chooses has an error less than epsilon, if even 1 is > epsilon then it is not epsilon exhausted because you can possibly make an error.

Haussler Theorem - Bound True Error

- Get all the hypothesis that have high true error, i.e. > epsilon, on the training set. How much data do we need to confirm this data is bad?
- Probably that you are wrong is > epsilon and probabily that you are right is <= 1 - epsilon.

- Probability of successful training

# Lesson 9 - VC Dimesions

- Infinite Hypothesis Spaces
`m >= 1/e (ln |H| + ln 1/sigma)`

- What happens if the hypothesis space is very large?
- You can track on an attribute and create an upper/lower bound.

- Power of a hypothesis space
- What is the largest set of inputs that the hypothesis class can label in all possible ways?
- There is no way to label a pair of inputs in all four ways. Theta will only separate down the middle.

- Labeling in all possible ways is called shattering. The quantity of largest set of inputs is called VC (vapnik-chervonekis) dimension. What is the amount of data needed to learn is what the VC dimension tries to answer.
- VC dimension is often the number of parameters but for each dimension you would go up.
- What is VC of finite H?

# Lesson 10: Bayesian Learning

- Learn the best hypothesis given data and some domain knowledge
- Learn the most probable H given data and domain knowledge
`argmax Pr(h|D)`

- Bayes Rule
Pr(D) means the prior belief on the data.

Pr(D|h) means the probability of data given the hypothesis. What is the likely hood of seeing a given data with a certain hypothesis. D is the training data which is a set of data with labels D={(x_i,d_i),…}

Pr(h) is the prior on the hypothesis drawn from the hypothesis space. Which hypothesis is more likely compared to others? The prior is the domain knowledge. - Bayseian learning is about calculating P(h|D) for each h in H. And then outputting the max Probability. Computationally it is not possible because the hypothesis space might be too huge and infinite.
- Baysian Learning in Action
- Given {<x_i, d_i>} as examples. They are noise free and true examples that tell you what c is.
- c is in H. c is the true concept.
- uniform prior. No hypothesis is more likely than any other.
- We want to compute bayes rule on this.
- Pr(h) = 1/|H|
- Pr(D|h) = { 1 if d_i = h(x_i) for all x_i,d_i in D else 0} what is the probability that I would see the label h on the data?
- Pr(D) = sum h_i in H P(D|h_i)P(h_i) = sum h_i in VS_H,D * 1/|H| = |VS| / |H|
- Pr(h|D) = 1 / |VS| for h in |VS|

# Lesson 12: Randomized Optimization

There is some input space X.

There is an objective function (fitness function) that maps the input (x) to a score. f: x -> R

Goal is to find some value, x* in the input space, such that the fitness value for x* is the closest to the max x f(x).

Approaches to optimization:

- Generate and test if you have small input space or complex function
- Calculus: function has to have a derivative. Derivative must be solveable = 0.
- Newton’s method: Function has a derivative and iteratively solveable. You want a single optimum and don’t want to get stuck in a local maximas.

- What if these assumptions don’t hold such as if you have a big input space, complex function, no derivative or hard to find derivative. You have many local optimas. This is when you use
**Randomized optimization**

Hill Climbing

- Guess x in X
- Repeat:
- let n* = argmax n in N(x) for f(n)

if f(n*) > f(x) : x = n

else stop In the picture above you select the random point at the tick mark and then start moving left and right of that point. If the point to the left or right are > x, then you make that your new value for x. If shows you keep going up and then you stop at the top where left and right are both < x.

- let n* = argmax n in N(x) for f(n)
- A problem is that you can get stuck at any local optima.

Random restart hillclimbing helps with that sticking issue in the local mins.

- Once a local optimum is reached, restart from another randomly chosen x.
- Advantages? Multiple tries to find a good starting point. Not much more expensive (constant factor). local improvements give global improvements.

- Once a local optimum is reached, restart from another randomly chosen x.
Simulated Annealing

Don’t always improve - somtimes you need to search. (Exploit and explore) Visit more of the space hoping you can find something better.

Exploiting is like overfitting. Overly believing you are heading in the right direction. Search is believing nothing and taking advantage of nothing but you should exploit a little otherwise you learn nothing.

Related to Metropolis-Hastings and metallugary. It’s kind of heating and cooling rapidly.

Algorithm

For a finite set of iterations:

a. Sample new point x_t in N(x)

b. Jump to new sample with probablility given by an acceptance probability function P(x, x_t, T)

c. Decrease temperature T > 0.

P(x, x_t, T) = {`1 if f(x_t) >= f(x), e**(f(x_t)-f(x) / T), otherwise }`

Big T means you are willing to take lower steps and likely to accept.

Small T mean you have a very large e so you are only going uphill.

T->0: like hill climbing

T->inf: random walk- You want to decrease T slowly for a chance to explore. Need to bridge between local optima.
- Pr(ending at x) = e**(f(x)/T) / Z_T

Genetic Algorithms

- This is like a contour map. We are trying to find the peaks. Guess the green points. Increasing on dimension 1 or 2 will give you a better value (higher). What if you moved in dimension 1 and 2 which will get you to top left?
- Think of it all as a population of individuals. a Local search is a mutation N(X). the notion of cross over is that the population holds information so you can combine their attributes to create something better like how you go into the top left. Generations is an iterations of improvement.
- GA Skeleton