Coursera Machine Learning Lecture Notes

=Introduction=

Lecture 3 - Supervised Learning
Supervised learning is defined as an algorithm where the algorithm is provided with correct answers. As an example, a line fit is supervised learning, because you provide the "right answers". In the example given, the fitting algorithm was provided with the X values (house size) and the Y values (home values) in order to predict a new value based on a size that is not provided in the data.

Regression: Continuous valued output

In the next example, discrete values were plotted - in the example, values of 0 and 1 (yes and no), although this can be expanded to an arbitrary number of discrete output values. This is defined as:

Classification: Discrete valued output

An alternative plotting method for classification problems is to use different symbols to denote each discrete result type. In this case, various input value types can be on each axis, with the result denoted by the symbol type.

Problem: Examples given until this point have really been focused on small numbers of features, but the real power of machine learning is to use much larger (~infinite) numbers of features. Example of this algorithm will be provided later.

Lecture 4 - Unsupervised Learning
In unsupervised learning, we are given a data set, and not told what to do with it, or told what each label is. Handed some data and told to "figure out what you can learn from it".

Clustering Algorithm: Look for data that is somehow grouped together, and organize it in this way. For example, Google News clusters related news articles.

Second example - DNA coding. Table of individuals, with a mapping of whether or not each individual has a certain gene. Algorithm clusters these individuals into different groups based on the patterns of genes they show - there is no "right answer" given up front.

Other examples: Computing Clusters, Social network analysis (which groups of friends all know each other?), market segmentation (group customers into different market segments), astronomical data analysis.

Cocktail Party Problem: Two speakers, two microphones - each microphone records a different overlapping combination of both speakers' voices at different volumes. Hand data to a cocktail party algorithm, and tell it to "find structure". Algorithm is able to separate out the two separate voices.

single line of code!

single value decomposition - linear algebra routine, built into Octave (GNU Matlab clone)

=Linear Regression with One Variable=

Lecture 5 - Model Representation
Example: Housing price data from Portland, OR. Given size in ft^2 and price in 1000s of dollars. Re-iterate, this is a supervised learning algorithm, and specifically a regression problem (predicts a real-valued output).

Given: A Training Set of housing prices, a table of size vs price

Notation:

$$m$$ = number of training examples

$$x$$ = "input" variable / features

$$y$$ = "output" variable / "target" variable

$$(x,y)$$ = a single training example pair

$$(x^{(i)},y^{(i)})$$ - ith training example

$$h$$ - output of learning algorithm, stands for "hypothesis" - a function that maps from $$x$$'s to $$y$$'s. This is somewhat of a misnomer, since it's not a traditional "scientific" hypothesis, but it's a historical artifact of the early days of machine learning as a field.

for a linear regression, we can write:

$$h(x) =\theta_0 + \theta_1 x$$

specifically, we can note that this is univariate (one variable) linear regression.

Lecture 6 - Cost Function
Different parameters give different "hypotheses" - normal line stuff.

In order to fit this, we solve a minimization problem, specifically, minimize for $$\theta_0$$ and $$\theta_1$$ such that the Cost Function

$$J(\theta_0,\theta_1) = \frac{1}{2m}\sum\limits_{i=1}^m (h(x^{(i)} - y^{(i)})^2$$

This is also sometimes also called the squared error function - most common for regression problems, especially linear.

Lecture 7 - Cost Function Intuition I
Example, use a simplified cost function with $$\theta_0 = 0$$, so

$$h(x) = \theta_1 x$$

We can plot $$J(\theta)$$ as a function of $$\theta$$, and visually locate the minimum.

Lecture 8 - Cost Function Intuition II
Keep $$\theta_0$$ AND $$\theta_1$$, so now we are minimizing $$J$$ using two parameters. Instead of an X-Y plot, we can create a 3-D surface plot with $$\theta_0$$ AND $$\theta_1$$ on the X and Y axes, with $$J$$ on the Z axis. This is typically a bowl-shaped graph.

Rather than 3D wireplots, we can also use contour plots to visualize the cost function J. Contour plots show lines, often in the form of ellipses for cost functions, which represent a set of parameters which all result in the same J value.

Lecture 12 - What's Next
=Linear Algebra Review=