Friday, June 1, 2012

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN

Continuing from my previous blog in walking down the list of Machine Learning techniques.  In this post, we'll be covering Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor.  Again, we'll be using the same iris data set that we prepared in the last blog.

Neural Network

Neural Network emulates how the human brain works by having a network of neurons that are interconnected and sending stimulating signal to each other.

In the Neural Network model, each neuron is equivalent to a logistic regression unit.  Neurons are organized in multiple layers where every neuron at layer i connects out to every neuron at layer i+1 and nothing else.


The tuning parameters in Neural network includes the number of hidden layers, number of neurons in each layer, as well as the learning rate.

There are no fixed rules to set these parameters and depends a lot in the problem domain.  My default choice is to use a single hidden layer and set the number of neurons to be the same as the input variables. The number of neurons at the output layer depends on how many binary outputs need to be learned.  In a classification problem, this is typically the number of possible values at the output category.

The learning happens via an iterative feedback mechanism where the error of training data output is used to adjusted the corresponding weights of input.  This adjustment will be propagated back to previous layers and the learning algorithm is known as back-propagation.

Here is an example in R

> library(neuralnet)
> nnet_iristrain <-iristrain
> #Binarize the categorical output
> nnet_iristrain <- cbind(nnet_iristrain, 
                          iristrain$Species == 'setosa')
> nnet_iristrain <- cbind(nnet_iristrain,
                          iristrain$Species == 'versicolor')
> nnet_iristrain <- cbind(nnet_iristrain, 
                          iristrain$Species == 'virginica')
> names(nnet_iristrain)[6] <- 'setosa'
> names(nnet_iristrain)[7] <- 'versicolor'
> names(nnet_iristrain)[8] <- 'virginica'
> nn <- neuralnet(setosa+versicolor+virginica ~ 
                  Sepal.Length+Sepal.Width
                              +Petal.Length
                              +Petal.Width,
                  data=nnet_iristrain, 
                  hidden=c(3))
> plot(nn)
> mypredict <- compute(nn, iristest[-5])$net.result
> # Put multiple binary output to categorical output
> maxidx <- function(arr) {
      return(which(arr == max(arr)))
  }
> idx <- apply(mypredict, c(1), maxidx)
> prediction <- c('setosa', 'versicolor', 'virginica')[idx]
> table(prediction, iristest$Species)
            
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         3
  virginica       0          0         7


Here is the plot of the Neural network we learn
Neural network is very good at learning non-linear function and also multiple outputs can be learnt at the same time.  However, the training time is relatively long and it is also susceptible to local minimum traps.  This can be mitigated by doing multiple rounds and pick the best learned model.

Support Vector Machine

Support Vector Machine provides a binary classification mechanism based on finding a dividing hyperplane between a set of samples with +ve and -ve outputs.  It assumes the data is linearly separable.

The problem can be structured as a quadratic programming optimization problem as maximizing the margin subjected to a set of linear constraints (ie: data output on one side of the line must be +ve while the other side is -ve).  This can be solved by quadratic programming technique.


If the data is not linearly separable due to noise (majority is still linearly separable), then an error term will be added to penalize the optimization.


If the data distribution is fundamentally non-linear, the trick is to transform the data to a higher dimension and hopefully the data is linearly separable in that higher dimension.  The optimization term turns out to be a dot product of the transformed points in the high dimensional space, which was found to be equivalent to perform a kernal function in the original (before transformation) space.


The kernal function provides a cheap way to equivalently transform the original point to a high dimension (since we don’t actually transform it) and perform the quadratic optimization in that high dimension space.

There are a couple tuning parameters (like the penalty / cost) and so the usual way is to do it in 2 steps, find the optimal parameter and then train the SVM model using that.  Here is the example code in R.

> library(e1071)
> tune <- tune.svm(Species~., 
                   data=iristrain, 
                   gamma=10^(-6:-1), 
                   cost=10^(1:4))
> summary(tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation 
- best parameters:
 gamma  cost
 0.001 10000
- best performance: 0.03333333 
> model <- svm(Species~., 
               data=iristrain, 
               method="C-classification", 
               kernel="radial", 
               probability=T, 
               gamma=0.001, 
               cost=10000)
> prediction <- predict(model, iristest)
> table(iristest$Species, prediction)
            prediction
             setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          3         7
>



SVM with Kernal function is a highly effective model and works well across a wide range of problem sets.  Although it is a binary classifier, it can be easily extended to multi-class classification by training a group of binary classifiers and using “one vs all” or “one vs one” to predict.

SVM predicts the output based on the distance to the dividing hyperplane, which doesn’t directly provide a probability estimation of its prediction.  Calibration technique can be used which basically learn a logistic regression model between the distance of the hyperplane and the binary output and using the logistic regression to estimate the probability.

SVM is a very powerful technique and perform good in a wide range of non-linear classification problems.  It works best when you have a small set of input features because it will expand the features into higher dimension space, providing that you also have a good size of training data (otherwise, overfit can happen).  However, SVM is not very scalable in dealing with large number (billions) of training data, so in that case, Logistic Regression with manually expanded feature set will be more pragmatic. 

Naive Bayes

From a probabilistic viewpoint, the predictive problem can be viewed as a conditional probability estimation; trying to find Y where P(Y | X) is maximized.

From Bayesian rule, P(Y | X) == P(X | Y) * P(Y) / P(X)

This is equivalent to finding Y where P(X | Y) * P(Y) is maximized.

Lets say input X contains 3 categorical features, X1, X2, X3.  In the general case, we assume each variable can potentially influence any other variable and therefore the joint distribution becomes:
P(X | Y) = P(X1 | Y) * P(X2 | X1, Y) * P(X3 | X1, X2, Y)

Notice the last term of the above equation has number of entries exponentially proportional to the number of input variables.


To reduce the complexity, we can make some independence assumption based on domain knowledge and remove some edges.  We form a Bayesian Network where the arrows indicate causal relationship.


In Naïve Bayes model, we take our independence assumption even further and assume each input variable are independent of each other given Y.


Since P(X | Y) == P(X1 | Y) * P(X2 | Y) * P(X3 | Y).
The problem is to find Y to maximize P(X1 | Y) * P(X2 | Y) * P(X3 | Y) * P(Y)

Each term can be learned by counting the training data.  Therefore we can estimate P(Y | X) and pick Y that maximize its value.

It is possible that some patterns never show up in training data ie: P(X1=a | Y=y) is 0.  To deal with this situation, we apply the smoothing technique by assuming we have seen the data of each possible value one more than actual.  Now ...

P(X1=a | Y=y) == (count(a, y) + 1) / (count(y) + m)
where m is the number of possible values in X1.

When the input features are numeric, say a = 2.75, we can assume X1 is normal distribution.  Find out the mean and standard deviation of X1 and then estimate P(X1=a) using the normal distribution function.

Here is how we use Naïve Bayes in R
> library(e1071)
> # Can handle both categorical and numeric input, 
> # but output must be categorical
> model <- naiveBayes(Species~., data=iristrain)
> prediction <- predict(model, iristest[,-5])
> table(prediction, iristest[,5])
            
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         2
  virginica       0          0         8
 

Notice the independence assumption is not true in most cases, nevertheless the system still perform incredibly well.  One strength of Naïve Bayes is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution.

K Nearest Neighbor

K Nearest neighbor is also called instance-based learning, in contrast to model-based learning, because it is not learning any model at all.  The training process is basically memorizing all the training data.  To predict a new data point, we found the closest K (a tunable parameter) neighbors from the training set and let them vote for the final prediction.

To determine the “nearest neighbors”, a distance function need to be defined (e.g. Euclidean distance is a common one for numeric input variables).  The voting can also be weighted among the K-neighbors based on their distance from the new data point.


Here is the R code of using K-nearest neighbor for classification.

> library(class)
> train_input <- as.matrix(iristrain[,-5])
> train_output <- as.vector(iristrain[,5])
> test_input <- as.matrix(iristest[,-5])
> prediction <- knn(train_input, test_input, 
                    train_output, k=5)
> table(prediction, iristest$Species)
            
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9
> 


The strength of K nearest neighbor is its simplicity as no model needs to be trained. Incremental learning is automatic when more data arrives (and old data can be deleted as well). The weakness of KNN is it doesn't handle high number of dimensions well.

In my next post, I will finish up the other machine learning techniques on decision trees and ensemble methods.

4 comments:

Snehasish said...

How many hidden layers are you using for the Neural Network?

Taguri said...

thanks for this article, it helped me :)

Unknown said...

Hi...Great blog...I needed a help in neural net package in R...it would be greatful if I can have your Email ID at umeshdhanwal@gmail.com

Unknown said...

Hi,i tried to implement the algorithms for my data having 4 classes and 5 features.

But i could not get the required accuracy in SVM and neural network.

What is the optimal value for gamma and cost?

Please do reply.