I have a training data of 5 columns, where c1 is the dependent variable and columns c2, c3, c4, c5 are independent variables.
I want to estimate c1 as sum of functions of ci (where i = 2, 3, 4, 5) in a way which minimizes residual error.
c1 ~ ( a2 F2(c2) + a3 F3(c3) + a4 F4(c4) + a5 F5(c5) )
I wrote a Python script to read columns c2, c3, c4, c5 for training. Now if I input a new row with c2, c3, c4, c5, then my script can generate c1 for this row.
But actually, what am I supposed to do by this statement "estimate c1 as sum of functions of ci (where i = 2, 3, 4, 5) in a way which minimizes residual error" ? I have no idea of ML, and would appreciate if somebody throws light on what is meant by estimating c1.
Thank you.
Related
This question further extends the idea on the question:
Cypher: how to find all the chains of single nodes not repeated?
For example, in a graph like this:
(a1:TestNode)-[:REL]->(r1:Route)-[:REL]->(a2:TestNode)-[:REL]->(s1:Route)-[:REL]->(a1:TestNode)
(a2:TestNode)-[:REL]->(r2:Route)-[:REL]->(a3:TestNode)-[:REL]->(s2:Route)-[:REL]->(a2:TestNode)
(a3:TestNode)-[:REL]->(r3:Route)-[:REL]->(a4:TestNode)-[:REL]->(s3:Route)-[:REL]->(a3:TestNode)
Graphically:
s3 ← a4
↙ ↗
s2 ← a3 → r3
↙ ↗
s1 → a2 → r2
↙ ↗
a1 → r1
Cypher code:
CREATE (a1:TestNode {name:'a1'})-[:REL]->(r1:Route {name:'r1'})-[:REL]->(a2:TestNode {name:'a2'})-[:REL]->(s1:Route {name:'s1'})-[:REL]->(a1),
(a2)-[:REL]->(r2:Route {name:'r2'})-[:REL]->(a3:TestNode {name:'a3'})-[:REL]->(s2:Route {name:'s2'})-[:REL]->(a2),
(a3)-[:REL]->(r3:Route {name:'r3'})-[:REL]->(a4:TestNode {name:'a4'})-[:REL]->(s3:Route {name:'s3'})-[:REL]->(a3)
Afterwards, we can find a route from a4 to a1 by this command:
MATCH p = (a4:TestNode {name: 'a1'})-[r:REL*]->(a1:TestNode {name: 'a4'})
WITH [a4] + nodes(p) AS ns, p
WHERE ALL (n IN ns
WHERE 1=SIZE(FILTER(m IN TAIL(ns)
WHERE m = n)))
RETURN p
Question:
1. If I extend the above create query to have 2,000 'a' nodes, i.e. up to
(a2000)-[:REL]->(r2000:Route {name:'r2000'})-[:REL]->(a2001:TestNode {name:'a2001'})-[:REL]->(s2000:Route {name:'s2000'})-[:REL]->(a2000),
I found that my computer becomes very slow, and 2GB of memory is occupied by neo4j. Is it normal?
Then I want to find a route from a2001 to a1. The system cannot find the solution (which is obvious a2001->a2000->a1999.....->a1). I guess it is because of the loops in between. In the previous question mentioned above, the query should have avoided loops because duplicates are not allowed.
My purpose is to extend this idea such that possible routes between 2 locations can be identified on a connected graph. Many thanks.
I would like to test different descriptors (like SIFT, SURF, ORB, LATCH etc.) in terms of precision-recall and computation time for my image dataset in order to understand which one is more suitable.
There is any pre-built tester in OpenCV for this purpose? Any other alternative or guideline?
You can use the code in the foolowing link to compute recall vs. precision curves:
http://www.robots.ox.ac.uk/~vgg/research/affine/desc_evaluation.html#code
In order to plot them, you need to detect keypoint and extract descriptors in each image in the dataset. Next, you write the descriptors for each image in the following format:
descriptor_size
nbr_of_regions
x1 y1 a1 b1 c1 d1 d2 d3 ...
x2 y2 a2 b2 c2 d1 d2 d3 ...
....
....
x, y - center coordinates
a, b, c - ellipse parameters ax^2+2bxy+cy^2=1
d1 d2 d3 ... - descriptor values, binary values in case of ORB and LATCH
I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....
For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.
Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?
(I guess this is the same idea as in some comments, e.g., #Alleo)
Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:
a1, a2, ..., ap ---------- weights of the p positive examples
b1, b2, ..., bn ---------- weights of the n negative examples
Suppose
a1 + a2 + ... + ap = A
b1 + b2 + ... + bn = B
As you pointed out, if the examples have unit weights, the entropy would be:
p p n n
- _____ log (____ ) - ______log(______ )
p + n p + n p + n p + n
Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.
A A B B
- _____ log (_____) - ______log(______ )
A + B A + B A + B A + B
Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.
Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.
I have a regression related question, but I am not sure how to proceed. Consider the following dataset, with A, B, C, and D as the attributes (features) and a decision variable Dec for each row:
A B C D Dec
a1 b1 c1 d1 Y
a1 b2 c2 d2 N
a2 b2 c3 d2 N
a2 b1 c3 d1 N
a1 b3 c2 d3 Y
a1 b1 c1 d2 N
a1 b1 c4 d1 Y
Given such data, I want to figure out most compact rules for which Dec evaluates to Y.
For example, A=a1 AND B=b1 AND D=d1 => Y.
I would prefer specifying the thresholds for the Precision of these rules, so that I can filter them out as per my requirement. For example, I would like to see all rules which provide at least 90% precision. This can provide me better compaction of the rules. The above mentioned rule provides 100% precision, whereas B=b1 AND D=d1 => Y has 66% precision (it errs on the 4th row).
Vaguely, I can see that this is similar to building a decision tree and finding out paths which end in Y. If I understand correctly, building a regression model would provide me the attributes which matter the most, but I need combinations of actual values from the attributes which lead to Y.
The attribute values are multi-valued, but that is not a hard constraint. I can even assume them to be boolean.
Is there any library in existing tools such as Weka or R that can help me?
Regards
I don't think this is a regression problem. This seems like a classification problem where you are trying to classify Y or N. You could build ensemble learners like Adaboost and see how the decisions vary from tree to tree or you could do something like elastic net logistic regression and see what the final weights are.
Given two boolexpr b1,b2
say
b1=x1>=4&&x2>=5
b2=x2>=5&&x1>=4
Can we use .net API for Z3 to know whether b1 and b2 are actually the same constraint? )(meaning that the value of x1 and x2 allowed by b1 and b2 are the same)
Yes. You want to prove that b1 equals b2, which you can do by showing the negation of b1 == b2 is unsatisfiable. Here's an example showing how to do this in Z3Py, and you can use basically the same steps in the .NET API: http://rise4fun.com/Z3Py/M4R1
x1, x2 = Reals('x1 x2')
b1=And(x1>=4, x2>=5)
b2=And(x2>=5, x1>=4)
s = Solver()
proposition = b1 == b2 # assertion is whether b1 and b2 are equal
s.add(Not(proposition))
# proposition proved if negation of proposition is unsat
print s.check() # unsat
b1=And(x1>=3, x2>=5) # note difference
b2=And(x2>=5, x1>=4)
s = Solver()
proposition = b1 == b2
s.add(Not(proposition))
print s.check() # sat