Mar 08, 2019
#cs #math


This semester, I am taking 6.036: Introduction to Machine Learning. In the first two lectures, we covered the perceptron algorithm (Rosenblatt, 1957). For some reason, I had always thought that perceptron was a synonym for the sigmoid function. I guess because perceptron sounds like neuron and the sigmoid function kind of looks like a neuron action potential (but not really)? In any case, the perceptron algorithm is a really cool algorithm that finds linear binary classifiers.

What is a linear binary classifier?

In two dimensions, if we are given points on a plane that are each labeled \(+1\) or \(-1\), a linear binary classifier is a line such that all the points on one side of the line are labeled \(+1\) and all the points on the other side are labeled \(-1\). If such a line exists, we call the original dataset linearly separable.

In general, we are given a dataset \(D_n\) of \(n\) points \(x_i \in \mathbb{R}^d\), each with a label \(y_i \in \{-1, 1\}\). Classifiers are represented as a dimension-\((d-1)\) hyperplane defined by \( \theta \cdot x + \theta_0 = 0\). A classifier predicts that the label for a point \( x_i \) is \( \text{sign}(\theta \cdot x_i + \theta_0) \), where \( \text{sign}(x) = 1 \) if \( x > 0 \) and \( -1 \) otherwise. Thus, the classifier represented by the hyperplane successfully classifies the dataset if \(y_i \cdot (\theta \cdot x_i + \theta_0) > 0\) for all \(1 \le i \le n\). This is because the quantities \( y_i \) and \( \theta \cdot x_i + \theta_0 \) have the same sign for a correctly classified point, so their product is positive. That was a mouthful!

Note that \(\theta \in \mathbb{R}^d\) and \(\theta_0\) is a scalar. For the above example, the linear classifier shown is \( \theta = (-1, 1)\) and \(\theta_0 = 0\). For a given dataset, many different linear classifiers can exist.

What is the perceptron algorithm?

Given a linearly separable dataset, the perceptron algorithm finds values of \(\theta\) and \(\theta_0\) that successfully classify the data. If the dataset is not linearly separable, then the perceptron algorithm will not terminate.

Here is the pseudocode for the algorithm:

def perceptron(data, labels):
    Pseudo-python for perceptron algorithm
    d, n = data.dims # data is d x n array
    th = [0] * n # initialize theta to vector of zeros
    th_0 = 0
    while True:
        done = True
        for i in range(d):
            if labels[i] * (th * data[i] + th_0) <= 0: # check for misclassification
                th += labels[i] * data[i]
                th_0 += labels[i]
                done = False
        if done: break # stop when no mistakes are made on entire dataset
    return (th, th_0)

Explanation of Algorithm

The algorithm is repeatedly iterating over each point in the dataset. The condition labels[i] * (th * data[i] + th_0) <= 0 checks if the \(i\)-th data point is correctly classified. To see why, observe that the classifier predicts that the label of the \(i\)-th data point is the sign of th * data[i] + th_0, so the product of this quantity and labels[i] is only positive if the prediction is correct. If it isn't, then the algorithm adds a correction term to th and th_0. This effectively brings the linear classier "closer" to that misclassified data point, so that it will hopefully be correctly identified in the future. Once the algorithm completes a full iteration of the dataset without any misclassifications, it returns the classifier.

Pre-proof of Algorithm

It should still be somewhat shocking that the algorithm works. Specifically, why do the chosen correction terms work, and why is the algorithm guaranteed to terminate?

Before continuing with a rigorous proof of the algorithm, we will make a simplifying assumption - namely, that there exists a linear classifier for the input dataset that passes through the origin. In other words, \(\theta_0 = 0\). This is a valid assumption because if we are given a dataset \(x \in \mathbb{R}^d\), we can convert it to a dataset \(x' \in \mathbb{R}^{d+1}\) with \(x'_i = (x_i, 1)\). If \(\theta, \theta_0\) is a valid classifier for \(x\), then \(\theta' = (\theta, \theta_0)\) will be a valid classifer for \(x'\) because \(\theta \cdot x + \theta_0 = \theta' \cdot x'\). If our perceptron algorithm can find \(\theta'\), which is a linear classifier that passes through the origin, then that gives us \(\theta\) and \(\theta_0\), which is a linear classifer for the original dataset. This assumption simplifies the proof because we can ignore the quantity \( \theta_0 \).

To begin the proof, we will start with a few more definitions. Define the signed distance of a data point \((x_i, y_i)\) with respect to a classifier \(\theta\) as \(y_i \frac{\theta \cdot x_i}{|\theta|}\). The magnitude of this number is the distance from the point to the hyperplane. The number is positive if the point is correctly classified, and negative otherwise. You should verify that the formula is correct before moving on.

Define the margin of a classifer as the minimum value of the signed distance over all points in the dataset. Note that if a classifier has a positive margin, then it correctly classifies all the points.

Since we are assuming the dataset is linearly separable, there exists a classifier \(\theta^*\) with margin \(\gamma > 0\).

Finally, let \(R\) be an upper bound for the magnitude of our data points. Since each data point is a vector in \(\mathbb{R}^d\), the magnitude of a data point is just the length of the vector. Mathematically, \(|x_i| \le R\) for all \(i\).

That's all the definitions we need! To recap, we showed that it is valid to assume that \(\theta_0=0\). We also defined the terms signed distance, margin, \(\theta^*\), \(\gamma\), and \(R\). Make sure you understand what each term means before moving forward.

Actual Proof

Let \(\theta^{(k)}\) be the value of the classifier after making the \(k\)-th mistake. A mistake happens when labels[i] * (th * data[i] + th_0) <= 0 in the if statement of our for loop. Note that \(\theta^{(0)} = 0\). WLOG, assume that the \(k\)-th mistake occurs at the data point \((x_i, y_i)\).

Consider the quantity \(\cos(\theta^{(k)}, \theta^*) = \)\(\frac{\theta^{(k)} \cdot \theta^*}{|\theta^{(k)}| |\theta^*|}\), where \(\cos(\alpha, \beta)\) is defined to be the cosine of the angle between \(\alpha\) and \(\beta\). We will prove that if \(k\) can become sufficiently large, the RHS will eventually be greater than one. However, since the cosine of the angle between \(\theta^{(k)}\) and \(\theta^*\) cannot be greater than one, this places an upper bound on \(k\). This implies that the algorithm will only make a finite number of mistakes, which means it will eventually find a classifier.

First, let's look at the term \( \frac{ \theta^{(k)} \cdot \theta^* }{ | \theta^* | } \). We have the following derivation: \[ \frac{ \theta^{(k)} \cdot \theta^* }{ | \theta^* | } = \frac{ (\theta^{(k - 1)} + y_ix_i) \cdot \theta^* }{ | \theta^* | } \] \[ = \frac{ \theta^{(k - 1)} \cdot \theta^* }{ | \theta^* | } + y_i \frac{ x_i \cdot \theta^* }{ | \theta^* | } \] \[ \ge \frac{ \theta^{(k - 1)} \cdot \theta^* }{ | \theta^* | } + \gamma \] \[ \ge k\gamma \] The last step follows from induction.

Next, let's look at the term \( \frac{ 1 }{ | \theta^{(k)} | } \). We have the derivation: \[ | \theta^{(k)} |^2 = | \theta^{(k - 1)} + y_ix_i |^2 \] \[ = |\theta^{(k - 1)}|^2 + 2y_i \theta^{(k - 1)} \cdot x_i + |y_i|^2 |x_i|^2 \] \[ \le |\theta^{(k - 1)}|^2 + R^2 \] \[ \le kR^2 \] In the second to last step, we use the fact that \(|y_i|^2 = 1\) because \(|y_i| = 1\), and \(y_i \cdot \theta^{(k - 1)}\cdot x_i < 0\) because the \(i\)-th data point was misclassified by \(\theta^{(k - 1)}\). Note that \(y_i \cdot \theta^{(k - 1)}\cdot x_i\) has the same sign as the signed distance of \((x_i, y_i)\). The last step follows from induction.

Combining these two results together gives us that \(\cos(\theta^{(k)}, \theta^*) \ge \sqrt{k} \frac{\gamma}{R} \). We also know that \(\cos(x) \le 1\), which implies that \( k \le (\frac{R}{\gamma})^2 \).

This completes the proof! We have shown that the perceptron algorithm cannot make mistakes without bound, because it would violate the range of \(\cos(x)\). Specifically, we proved that the algorithm will make at most \( (\frac{R}{\gamma})^2 \) mistakes before finding a valid linear classifier. As a sanity check, \(k\) increases with larger \(R\) because it should be harder to find a classifier that passes through the origin if the data points are far away, and \(k\) increases with smaller \(\gamma\) becuase it should be harder to find a classifer if the positive and negative points are close together, ie. there is a smaller margin.

As a side note, if we define \(n_i\) as the number of mistakes the algorithm makes on the \(i\)-th data point, we can write the classifier as \(\sum_{i = 0}^{n} n_iy_ix_i \). In other words, it is always possible to build a valid classifer for a linearly-separable dataset by taking a linear combination of the input \(x_i\). In some sense, this means that the granularity of the input data specifies the granularity for the classifier.

Here is a GIF showing the iterations of the perceptron algorithm on a dataset [source]:


Thanks Rohan Joshi for helping me proofread this post.