Figuring out attribute value combinations from regression model - machine-learning

I have a regression related question, but I am not sure how to proceed. Consider the following dataset, with A, B, C, and D as the attributes (features) and a decision variable Dec for each row:
A B C D Dec
a1 b1 c1 d1 Y
a1 b2 c2 d2 N
a2 b2 c3 d2 N
a2 b1 c3 d1 N
a1 b3 c2 d3 Y
a1 b1 c1 d2 N
a1 b1 c4 d1 Y
Given such data, I want to figure out most compact rules for which Dec evaluates to Y.
For example, A=a1 AND B=b1 AND D=d1 => Y.
I would prefer specifying the thresholds for the Precision of these rules, so that I can filter them out as per my requirement. For example, I would like to see all rules which provide at least 90% precision. This can provide me better compaction of the rules. The above mentioned rule provides 100% precision, whereas B=b1 AND D=d1 => Y has 66% precision (it errs on the 4th row).
Vaguely, I can see that this is similar to building a decision tree and finding out paths which end in Y. If I understand correctly, building a regression model would provide me the attributes which matter the most, but I need combinations of actual values from the attributes which lead to Y.
The attribute values are multi-valued, but that is not a hard constraint. I can even assume them to be boolean.
Is there any library in existing tools such as Weka or R that can help me?
Regards

I don't think this is a regression problem. This seems like a classification problem where you are trying to classify Y or N. You could build ensemble learners like Adaboost and see how the decisions vary from tree to tree or you could do something like elastic net logistic regression and see what the final weights are.

Related

How many equivalence classes in the RL relation for {w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}

How many equivalence classes in the RL relation for
{w in {a, b}* | (#a(w) mod m) = ((#b(w)+1) mod m)}
I am looking at a past test question which gives me the options
m(m+1)
2m
m^2
m^2+1
infinite
However, i claim that its m, and I came up with an automaton that I believe accepts this language which contains 3 states (for m=3).
Am I right?
Actually you're right. To see this, observe that the difference of #a(w) and #b(w), #a(w) - #b(w) modulo m, is all that matters; and there are only m possible values of this difference modulo m. So, m states are always sufficient to accept a language of this form: simply make the state corresponding to the appropriate difference the accepting state.
In your DFA, a2 corresponds to a difference of zero, a1 to a difference of one and a3 to a difference of two.

Where does the graph of the loss function in machine learning come from?

Where does the graph of the loss function in machine learning come from?
I am studying about machine learning. I sometimes don't understand models that have been optimized using regularization terms.
In the explanation of regularization, the following figure may appear.
Here is an example of the L1 regularization term. I have assumed that the model has two weight parameters w1, w2. That is, the equation of model y is expressed by the following equation.
y = w1x1 + w2x2
For simplicity, I ignored the bias term.
The red squares represent regularization terms. And the blue ellipses are represents the loss function without the regularization term.
The regularization term is given by
| w1 | ^ q + | w2 | ^ q = r ^ q (r is const.)
Therefore, the equation of the graph at w1> 0 and w2> 0 is expressed as follows.
w2 = (r ^ q-| w1 | ^ q) ^ (1 / q)
By substituting w1 for this equation (q = 0 at Lasso), you can draw a graph of the regularized term.
On the other hand, I could not draw a graph of the loss function. Perhaps you need more than one piece of data to draw this graph. For simplicity, I have assumed that I have only two pieces of data. I define them as (x11, x12, t1), (x21, x22, t2). When the loss function is MSE, it is expressed by the following equation.
Ed = 1/2 * {(t1-w1x11-w2x12) + (t1-w1x21-w2x22)}
If I simplify this, it is expressed as
Ed = a*w1^2 + b*w1 + c*w2^2 + d*w2 + e*w1*w2 + f
Here, a, b, c, d, e, and f are functions represented by all or part of x11, x12, x21, and x22. After finding a, b, c, d, e, and f, I thought that if we substitute w1 for this equation, we could draw a graph of the loss function. However, I cannot draw well.
Is the above understanding correct? Thank you.
To visualize the loss function, Ed which is a function of w1 and w2, we should visualize it as a 3 dimensional plot. For example, you can use Geogebra to visualize a 3 dimensional surface plot.
Here is an example, where a=3, b=-1, c=1, d =-1 , e=2.
The 2D plot that you see is called a countor plot. This link enables you to draw it online.
To draw a contour plot manually, you fix the value of Ed, then you obtained a quadratic equation, after which, as you varies w1, you can solve for your w2, for each w1, you can obtain up to 2 w2 as it is quadratic.
Remark: If you are looking for closed form expression in terms of arbitrary q, that could be more challenging.

Comparing feature descriptor test?

I would like to test different descriptors (like SIFT, SURF, ORB, LATCH etc.) in terms of precision-recall and computation time for my image dataset in order to understand which one is more suitable.
There is any pre-built tester in OpenCV for this purpose? Any other alternative or guideline?
You can use the code in the foolowing link to compute recall vs. precision curves:
http://www.robots.ox.ac.uk/~vgg/research/affine/desc_evaluation.html#code
In order to plot them, you need to detect keypoint and extract descriptors in each image in the dataset. Next, you write the descriptors for each image in the following format:
descriptor_size
nbr_of_regions
x1 y1 a1 b1 c1 d1 d2 d3 ...
x2 y2 a2 b2 c2 d1 d2 d3 ...
....
....
x, y - center coordinates
a, b, c - ellipse parameters ax^2+2bxy+cy^2=1
d1 d2 d3 ... - descriptor values, binary values in case of ORB and LATCH

Weighing Samples in a Decision Tree

I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....
For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.
Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?
(I guess this is the same idea as in some comments, e.g., #Alleo)
Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:
a1, a2, ..., ap ---------- weights of the p positive examples
b1, b2, ..., bn ---------- weights of the n negative examples
Suppose
a1 + a2 + ... + ap = A
b1 + b2 + ... + bn = B
As you pointed out, if the examples have unit weights, the entropy would be:
p p n n
- _____ log (____ ) - ______log(______ )
p + n p + n p + n p + n
Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.
A A B B
- _____ log (_____) - ______log(______ )
A + B A + B A + B A + B
Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.
Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.

Estimating dependent variable as sum of functions of independent variables

I have a training data of 5 columns, where c1 is the dependent variable and columns c2, c3, c4, c5 are independent variables.
I want to estimate c1 as sum of functions of ci (where i = 2, 3, 4, 5) in a way which minimizes residual error.
c1 ~ ( a2 F2(c2) + a3 F3(c3) + a4 F4(c4) + a5 F5(c5) )
I wrote a Python script to read columns c2, c3, c4, c5 for training. Now if I input a new row with c2, c3, c4, c5, then my script can generate c1 for this row.
But actually, what am I supposed to do by this statement "estimate c1 as sum of functions of ci (where i = 2, 3, 4, 5) in a way which minimizes residual error" ? I have no idea of ML, and would appreciate if somebody throws light on what is meant by estimating c1.
Thank you.

Resources