A Guide to Convolutional Neural Networks for Computer Vision. Salman KhanЧитать онлайн книгу.

(2.13)), but with all x_i replaced with ϕ(x_i), where ϕ provides a higher-dimensional mapping.

subject to:

Figure 2.7: For two classes, separable training datasets, such as the one shown in (a), there are lots of possible linear separators as shown with the blue lines in (a). Intuitively, a separating hyperplane (also called decision boundary) drawn in the middle of the void between the data samples of the two classes (the bold blue line in (b)) seems better than the ones shown in (a). SVM defines the criterion for a decision boundary that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier as shown in (b). In the hard-margin SVM, (b), a single outlier can determine the decision boundary, which makes the classifier overly sensitive to the noise in the data. However, a soft margin SVM classifier, shown in (c), allows some samples of each class to appear on the other side of the decision boundary by introducing slack variables for each sample. (d) Shows an example where classes are not separable by a linear decision boundary. Thus, as shown in (e), the original input space is projected onto , where a linear decision boundary can be found, i.e., using the kernel trick.

Dual Form of SVM

In the case that D ≫ d, there are many more parameters to learn for w. In order to avoid that, the dual form of SVM is used for optimization problem.

subject to:

where C is a hyper-parameter which controls the degree of misclassification of the model, in case classes are not linearly separable.

Kernel Trick

Since ϕ(x_i) is in a high-dimensional space (even infinite-dimensional space), calculating ϕ(x_i)^T · ϕ(x_j) maybe intractable. However, there are special kernel functions, such as linear, polynomial, Gaussian, and Radial Basis Function (RBF), which operate on the lower dimension vectors x_i and x_j to produce a value equivalent to the dot-product of the higher-dimensional vectors. For example, consider the function , where

It is interesting to note that for the given function ϕ in Eq. (2.18),

2. FEATURES AND CLASSIFIERS

Thus, instead of calculating ϕ(x_i)^T · ϕ(x_j), the polynomial kernel function K(x_i, x_j) = (1 + x_i^T x_j)² is calculated to produce a value equivalent to the dot-product of the higher-dimensional vectors, ϕ(x_i)^T · ϕ(x_j).

Note that the dual optimization problem is exactly the same, except that the dot product ϕ(x_i)^T · ϕ(x_j) is replaced by a kernel K(x_i, x_j) which corresponds to the dot product of ϕ(x_i) and ϕ(x_j) in the new space.

subject to:

In summary, linear SVMs can be thought of as a single layer classifier and kernel SVMs can be thought of as a 2 layer neural network. However, unlike SVMs, Chapter 3 shows that deep neural networks are typically built by concatenating several nonlinear hidden layers, and thus, can extract more complex pattern from data samples.

2.3.2 RANDOM DECISION FOREST

A random decision forest [Breiman, 2001, Quinlan, 1986] is an ensemble of decision trees. As shown in Fig. 2.8a, each tree consists of split and leaf nodes. A split node performs binary classification based on the value of a particular feature of the features vector. If the value of the particular feature is less than a threshold, then the sample is assigned to the left partition, else to the right partition. Figure 2.8b shows an illustrative decision tree used to figure out whether a photo represents and indoor or an outdoor scene. If the classes are linearly separable, after log₂(c) decisions each sample class will get separated from the remaining c – 1 classes and reach a leaf node. For a given feature vector f, each tree predicts independently its label and a majority voting scheme is used to predict the final label of the feature vector. It has been shown that random decision forests are fast and effective multi-class classifiers [Shotton et al., 2011].

Training

Each tree is trained on a randomly selected samples of the training data (usually samples of the training data). The remaining samples of the training data are used for validation. A subset of features is randomly selected for each split node. Then, we search for the best feature f[i] and an associated threshold τi that maximize the information gain of the training data after partitioning. Let H(Q) be the original entropy of the training data and H(Q |{f[i], τi}) the entropy of Q after partitioning it into “left” and “right” partitions, Ql; and Qr. The information gain, G, is given by

Figure 2.8: (a) A decision tree is a set of nodes and edges organized in a hierarchical fashion. The split (or internal) nodes are denoted with circles and the leaf (or terminal) nodes with squares. (b) A decision tree is a tree where each split node stores a test function which is applied to the input data. Each leaf stores the final label (here whether “indoor” or “outdoor”).

where

and |Ql| and |Qr| denote the number of data samples

Скачать книгу