Binary classification using radial basis kernel SVM with a single feature [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Is there any interpretation (graphical or otherwise) of a radial basis kernel SVM being trained with a single feature? I can visualize the effect in 2 dimensions (the result being a separation boundary that is curved rather than a linear line. (e.g http://en.wikipedia.org/wiki/File:Kernel_Machine.png).
I'm having trouble thinking of what this would be like if your original data only had a single feature. What would the boundary line look like for this case?

In one dimension, your data would be numbers, and decision boundary would be simply finite set of numbers, representing finite set of intervals of classification to one class and finite set of intervals of classification to the another one.
In fact, the decision boundary in R^2 is actually the set of points, for which weighted sum of gaussian distributions in support vectors (where alpha_i are these weights) is equal to b (intercept/threshold term). You can actually draw this distribution (in 3d now). Similarly, in 1d you would get a similar distribution, which could be drawn in 2d, and the decision would be based on this distribution being bigger/smaller than b.

This video shows what happen in a kernel mapping, it is not using the RBF Kernel, but the idea is the same:
http://www.youtube.com/watch?v=3liCbRZPrZA
As for the 1D case, there is not much difference, it would be something like this:

It would look line a line that switches back and forth between two different color (one color for each class). Nothing special happens in 1D besides SVMs being over kill.

Related

How to enforce rules like move legality in chess at the output of a neural network? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I apply rules, like chess rules, to a neural network, so the network doesn't predict/train invalid moves?
In the example of AlphaZero Chess, the network's output shape allows for all possible moves for any pieces starting on any square.
From the paper Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm:
A move in chess may be described in two parts: selecting the piece to move, and then
selecting among the legal moves for that piece. We represent the policy π(a|s) by a 8 × 8 × 73
stack of planes encoding a probability distribution over 4,672 possible moves. Each of the 8 × 8
positions identifies the square from which to “pick up” a piece. The first 56 planes encode
possible ‘queen moves’ for any piece: a number of squares [1..7] in which the piece will be
moved, along one of eight relative compass directions {N, N E, E, SE, S, SW, W, N W }. The
next 8 planes encode possible knight moves for that piece. The final 9 planes encode possible underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop or
rook respectively. Other pawn moves or captures from the seventh rank are promoted to a
queen.
So for example the network is allowed to output a positive probability for the move g1-f3 even if there isn't a knight on g1, or for the move e8=Q even if there isn't a pawn on e7, or d1-h5 if there is a Queen in d1 but another piece is blocking the diagonal.
The key is that it outputs a probability distribution over possible moves, and since it is trained by playing against itself where only legal moves are allowed, it will learn to output very low or zero probabilities for illegal moves.
More precisely, after a set number of self-play games, the network is trained using supervised learning to predict the probability and value of moves given a board position. At the very beginning of self-play the network has random weights and it will output significant probabilities for lots of impossible moves, but after one or more iterations of supervised learning the move output probabilities will start to look much more reasonable.
The reason the AlphaZero team chose this architecture over something that enforces rules in the network is simple: The output must take a fixed size, since there should be a fixed number of output neurons. It wouldn't make sense to have a different number of output neurons corresponding to a different number of legal moves. Alternatively, it wouldn't make sense to zero out outputs for non-legal moves inside the network, because this would be a highly non-standard operation which would probably be a nightmare to run backpropagation on. You would need to differentiate a chess move generator!
Furthermore, when the network uses its policy output to play games, it can simply normalize each output over only legal moves. In this way we are enforcing move legality within the self-play system, but not within the neural network architecture itself. This would be done with the aid of a move generator.
Since you are asking about keras, specifically, you could represent such an output layer as:
model.add(Dense(4672, activation='softmax'))
In Summary: It is not necessarily to enforce move legality in the architecture of a neural network for predicting chess moves, we can allow all possible moves (including illegal ones) and train the network to output low or zero probabilities for illegal moves. Then when we use the move probabilities for playing, we can normalize over only legal moves to get the desired result, but this is happening outside of the neural network.

What is the most appropriate ML technology for classifying/recognizing 2D scatter plot shapes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I would like to be able to automatically classify an input scatter plot into a limited, predefined set of 2-D scatter plots (see attached image), such as a Circle, a Cross, a Straight Line and a Curvy Line - such that, given any new scatter plot as input, the system can correctly categorize it by finding the closest category match.
Ideally, the classification process should also be scale-, translation- and rotation-invariant.
Can anyone suggest an appropriate technology for the training and classification of such 2-D patterns?
It doesn't need a supervised classifier. Unsupervised method like spectral clustering is designed for this kind of nonlinear clustering problems. The scattered dots will be assumed on a manifold surface instead of in an Euclidian space. Any curvy line could be taken as a manifold surface. Geodesic distance is used for clustering instead of ball-shaped Euclidian distance with manifold kernel.

Feature scaling (normalization) in multiple regression analysis with normal equation method? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am doing linear regression with multiple features. I decided to use normal equation method to find coefficients of linear model. If we use gradient descent for linear regression with multiple variables we typically do feature scaling in order to quicken gradient descent convergence. For now, I am going to use normal equation formula:
I have two contradictory information sources. In 1-st it is stated that no feature scaling required for normal equations. In another I can see that feature normalization has to be done.
Sources:
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html
http://puriney.github.io/numb/2013/07/06/normal-equations-gradient-descent-and-linear-regression/
At the end of these two articles information concerning feature scaling in normal equations presented.
The question is do we need to do feature scaling before normal equation analysis?
You may indeed not need to scale your features, and from theoretical point of view you get solution in just one "step". In practice, however, things might be a bit different.
Notice matrix inversion in your formula. Inverting a matrix is not quite trivial computational operation. In fact, there's a measure of how hard it's to invert a matrix (and perform some other computations), called condition number:
If the condition number is not too much larger than one (but it can still be a multiple of one), the matrix is well conditioned which means its inverse can be computed with good accuracy. If the condition number is very large, then the matrix is said to be ill-conditioned. Practically, such a matrix is almost singular, and the computation of its inverse, or solution of a linear system of equations is prone to large numerical errors. A matrix that is not invertible has condition number equal to infinity.
P.S. Large condition number is actually the same problem that slows down gradient descent's convergence.
You don't need to perform feature scaling when using the normal equation. It's useful only for the gradient descent method to optimize the performance. The article from the Stanford University provides the correct information.
Of course you can scale the features in this case as well, but it will not bring you any advantages (and will cost you some additional calculations).

Why do we use gradient descent in linear regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In some machine learning classes I took recently, I've covered gradient descent to find the best fit line for linear regression.
In some statistics classes, I have learnt that we can compute this line using statistic analysis, using the mean and standard deviation - this page covers this approach in detail. Why is this seemingly more simple technique not used in machine learning?
My question is, is gradient descent the preferred method for fitting linear models? If so, why? Or did the professor simply use gradient descent in a simpler setting to introduce the class to the technique?
The example you gave is one-dimensional, which is not usually the case in machine learning, where you have multiple input features.
In that case, you need to invert a matrix to use their simple approach, which can be hard or ill-conditioned.
Usually the problem is formulated as a least square problem, which is slightly easier. There are standard least square solvers which could be used instead of gradient descent (and often are). If the number of data points is very hight, using a standard least squares solver might be too expensive, and (stochastic) gradient descent might give you a solution that is as good in terms of test-set error as a more precise solution, with a run-time that is orders of magnitude smaller (see this great chapter by Leon Bottou)
If your problem is small that it can be efficiently solved by an off-the-shelf least squares solver, you should probably not do gradient descent.
Basically the 'gradient descent' algorithm is a general optimization technique and can be used to optimize ANY cost function. It is often used when the optimum point cannot be estimated in a closed form solution.
So let's say we want to minimize a cost function. What ends up happening in gradient descent is that we start from some random initial point and we try to move in the 'gradient direction' in order to decrease the cost function. We move step by step until there is no decrease in the cost function. At this time we are at the minimum point. To make it easier to understand, imagine a bowl and a ball. If we drop the ball from some initial point on the bowl it will move until it is settled at the bottom of the bowl.
As the gradient descent is a general algorithm, one can apply it to any problem that requires optimizing a cost function. In the regression problem, the cost function that is often used is the mean square error (MSE). Finding a closed form solution requires inverting a matrix that in most of the time is ill-conditioned (it's determinant is very close to zero and therefore it does not give a robust inverse matrix). To circumvent this problem, people are often take the gradient descent approach to find the solution which does not suffer from ill-conditionally problem.

Histogram of Oriented Gradients object detection [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
HOG is popular in human detection. Can it be used for detecting objects like cup in the image for example.
I am sorry for not asking programming question, but I mean to get the idea if i can use hog to extract object features.
According to my research I have dont for few days I feel yes but I am not sure.
Yes, HOG (Histogram of Oriented Gradients) can be used to detect any kind of objects, as to a computer, an image is a bunch of pixels and you may extract features regardless of their contents. Another question, though, is its effectiveness in doing so.
HOG, SIFT, and other such feature extractors are methods used to extract relevant information from an image to describe it in a more meaningful way. When you want to detect an object or person in an image with thousands (and maybe millions) of pixels, it is inefficient to simply feed a vector with millions of numbers to a machine learning algorithm as
It will take a large amount of time to complete
There will be a lot of noisy information (background, blur, lightning and rotation changes) which we do not wish to regard as important
The HOG algorithm, specifically, creates histograms of edge orientations from certain patches in images. A patch may come from an object, a person, meaningless background, or anything else, and is merely a way to describe an area using edge information. As mentioned previously, this information can then be used to feed a machine learning algorithm such as the classical support vector machines to train a classifier able to distinguish one type of object from another.
The reason HOG has had so much success with pedestrian detection is because a person can greatly vary in color, clothing, and other factors, but the general edges of a pedestrian remain relatively constant, especially around the leg area. This does not mean that it cannot be used to detect other types of objects, but its success can vary depending on your particular application. The HOG paper shows in detail how these descriptors can be used for classification.
It is worthwhile to note that for several applications, the results obtained by HOG can be greatly improved using a pyramidal scheme. This works as follows: Instead of extracting a single HOG vector from an image, you can successively divide the image (or patch) into several sub-images, extracting from each of these smaller divisions an individual HOG vector. The process can then be repeated. In the end, you can obtain a final descriptor by concatenating all of the HOG vectors into a single vector, as shown in the following image.
This has the advantage that in larger scales the HOG features provide more global information, while in smaller scales (that is, in smaller subdivisions) they provide more fine-grained detail. The disadvantage is that the final descriptor vector grows larger, thus taking more time to extract and to train using a given classifier.
In short: Yes, you can use them.

Resources