classification and prediction Decision tree - machine-learning

The computational complexity of the algorithm given training set D is O(n*|D|
log(|D|)), where n is the number of attributes describing the tuples in D and |D| is the number of training tuples in D. This means that the computational cost of growing a tree grows at most n Dlog(|D|) with |D| tuples.I am not able to log(|D|)part specifically.
Refrenece-Book Data minning concepts and tech.2nd edition page number 296
topic-Classification and prediction(Chapter 6)

The height of a balanced tree is at most O(log(n)). Is your tree balanced?

Related

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Wierd behavoir while training an SVM classifier

I am searching for the best value of C (Cost parameter) for training my SVM classifier. Here is my code:
clear all; close all; clc
% Load training features and labels
[y, x] = libsvmread('training_data.train'); %the training dataset is named training_data.train
cost=[2^-7,2^-5,2^-3,2^-1,2^1,2^3,2^5,2^7,2^9,2^11,2^13,2^15];
accuracy=zeros(1,length(cost)); %This array will store the accuracy values corresponding to each element in the cost array
for i = 1:length(cost)
opt = sprintf('-c %i -v 3',cost(i));
accuracy(i)=svmtrain(y,x,opt);
end
accuracy
I am using the LIBSVM library. When I run this program, the accuracy array is populated with pretty weird values:
Here is the output:
Columns 1 through 8:
67.335 93.696 91.404 92.550 93.696 93.553 93.553 93.553
Columns 9 through 12:
93.553 93.553 93.553 93.553
This means that I get the highest cross-validation accuracy on 2^-5. Should I get the highest accuracy on the highest value of C? (As much as I understand, it is a penalty factor for misclassification). Is this behavior expected of it?
(I am building a classifier for breast cancer identification using the UCI ML database).
Should I get the highest accuracy on the highest value of C? (As much as I understand, it is a penalty factor for misclassification).
No, there is no guarantee, as the SVM cost is not accuracy-based, it uses a specific surrogate function which only roughly behaves like accuracy, but you can expect many random fluctuations. In general, you should expect high values for high C, but not necessarily the highest one in general.
Is this behavior expected of it? (I am building a classifier for breast cancer identification using the UCI ML database).
Yes, it is a possible outcome.

Optimal Range for Universal Scalability Law (USL)

I'm doing a report and needed to have a test for the scalability of a mind map database software design idea. I wanted to use the USL equation to get a quantifiable metric for scalability, but I have no idea what range is considered good for USL. Any help would be appreciated :)
USL Eq'n:
C(N) = N/ (1 + α (N − 1) + β N (N − 1))
The three terms in the denominator of eqn. are associated repectively with the three Cs: the level of concurrency, a contention penalty (with stength α) and a coherency penalty (with stength β). The parameter values are defined in the range: 0 ≤ α, β < 1. The independent variable N can represent either
Do you mean number of measurements by "What range"? If yes then you cannot be assured about the required number of measured data points beforehand. Until you don't see any change in the predicted maximum number of concurrency after including more data points you have to keep adding more data points.
The estimated parameters and predictions there of are not reliable if you use the MS Excel spread sheet method explained in the book "Guerrilla Capacity Planning" book. Check out the paper "Mythbuster for the Guerrillas" to understand why and how to get reliable results. It might be worth to read the paper "Better Prediction Using the Super-serial Scalability Law Explained by the Least Square Error Principle and the Machine Repairman Model”.

Cross validation is very slow in Grid search (libsvm)

I am using libsvm on 62 classes with 2000 samples each. The problem is i wanted to optimize my parameters using grid search. i set the range to be C=[0.0313,0.125,0.5,2,8] and gamma=[0.0313,0.125,0.5,2,8] with 5-folds. the crossvalition does not finish at the first two parameters of each. Is there a faster way to do the optimization? Can i reduce the number of folds to 3 for instance? The number of iterations written keeps playing in (1629,1630,1627) range I don't know if that is related
optimization finished,
#iter = 1629 nu = 0.997175 obj = -81.734944, rho = -0.113838 nSV = 3250, nBSV = 3247
This is simply expensive task to find a good model. Lets do some calculations:
62 classes x 5 folds x 4 values of C x 4 values of Gamma = 4960 SVMs
You can always reduce the number of folds, which will decrease the quality of the search, but will reduce the whole amount of trained SVMs of about 40%.
The most expensive part is the fact, that SVM is not well suited for multi label classification. It needs to train at least O(log n) models (in the error correcting code scenario), O(n) (in libsvm one-vs-all) to even O(n^2) (in one-vs-one scenario, which achieves the best results).
Maybe it would be more valuable to switch to some fast multilabel model? Like for example some ELM (Extreme Learning Machine)?

Neural Networks: What does "linearly separable" mean?

I am currently reading the Machine Learning book by Tom Mitchell. When talking about neural networks, Mitchell states:
"Although the perceptron rule finds a successful weight vector when
the training examples are linearly separable, it can fail to converge
if the examples are not linearly separable. "
I am having problems understanding what he means with "linearly separable"? Wikipedia tells me that "two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line."
But how does this apply to the training set for neural networks? How can inputs (or action units) be linearly separable or not?
I'm not the best at geometry and maths - could anybody explain it to me as though I were 5? ;) Thanks!
Suppose you want to write an algorithm that decides, based on two parameters, size and price, if an house will sell in the same year it was put on sale or not. So you have 2 inputs, size and price, and one output, will sell or will not sell. Now, when you receive your training sets, it could happen that the output is not accumulated to make our prediction easy (Can you tell me, based on the first graph if X will be an N or S? How about the second graph):
^
| N S N
s| S X N
i| N N S
z| S N S N
e| N S S N
+----------->
price
^
| S S N
s| X S N
i| S N N
z| S N N N
e| N N N
+----------->
price
Where:
S-sold,
N-not sold
As you can see in the first graph, you can't really separate the two possible outputs (sold/not sold) by a straight line, no matter how you try there will always be both S and N on the both sides of the line, which means that your algorithm will have a lot of possible lines but no ultimate, correct line to split the 2 outputs (and of course to predict new ones, which is the goal from the very beginning). That's why linearly separable (the second graph) data sets are much easier to predict.
This means that there is a hyperplane (which splits your input space into two half-spaces) such that all points of the first class are in one half-space and those of the second class are in the other half-space.
In two dimensions, that means that there is a line which separates points of one class from points of the other class.
EDIT: for example, in this image, if blue circles represent points from one class and red circles represent points from the other class, then these points are linearly separable.
In three dimensions, it means that there is a plane which separates points of one class from points of the other class.
In higher dimensions, it's similar: there must exist a hyperplane which separates the two sets of points.
You mention that you're not good at math, so I'm not writing the formal definition, but let me know (in the comments) if that would help.
Look at the following two data sets:
^ ^
| X O | AA /
| | A /
| | / B
| O X | A / BB
| | / B
+-----------> +----------->
The left data set is not linearly separable (without using a kernel). The right one is separable into two parts for A' andB` by the indicated line.
I.e. You cannot draw a straight line into the left image, so that all the X are on one side, and all the O are on the other. That is why it is called "not linearly separable" == there exist no linear manifold separating the two classes.
Now the famous kernel trick (which will certainly be discussed in the book next) actually allows many linear methods to be used for non-linear problems by virtually adding additional dimensions to make a non-linear problem linearly separable.

Resources