I'm interested in learning machine learning but only if I know it's possible to find the max of an unknown function that get some numbers and return a number and for which a max exists.
Very unspecific question and not exclusively related to machine learning and AI.
Let's say you have a "black-box"-function f(x1,x2,x,...,xn) which takes n input variables and returns a scalar value, but you don't know it exactly, all you know is it has a maximum.
What you can do is use so called Evolution Strategy optimization techniques like CMA-ES. You basically sample random points x from a sampling function (e.g. gaussian), evaluate the points, rank them by their evaluation, adapt your sample function (e.g. by maximum likelihood of you k best points) and resample. After doing this over and over the mean of your gaussian will be a maximum (however not necessarily the global one!)
This however has nothing to do with machine learning or AI. In ML/AI you usually have points x and results y and you try to find the function f(x)=y that matches your points.
Related
while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.
I have 6 variables with values of something between 0 and 2 and a function, where these variables are given into. It predicts the result of a football match by looking at the past matches of both teams.
Depending on the variables, obviously, the output of the function changes. Each variable determines how much a certain part of the algorithm is weighed (e.g. how much less a game should be weighed that was 6 months ago compared to a game that was a week ago).
My goal is now to find out what the perfect ratios between the different variables and thus between the different parts of the algorithm are, so that my algorithm predicts most matches correctly. Is there any way of achieving that?
I thought of doing something like this with machine learning, something similar to linear/polynomal regression.
To determine how close a tip is I thought of giving:
2 points for when the tendency was right (predicted that Team A would win and Team A did win)
4 points for when the goal difference is right (Prediction: Team A
wins 2:1, actual result: 1:0)
5 points for when the result is
predicted correctly (predicted result: 2:1 and actual result: 2:1)
Which would make a loss function of
maximal points for game (which is 5) - points for predicted result
If I am able to minimize that, hopefully, after looking at some training sets (past seasons), it will theoretically score the most amount of points, when you give it a new season together with the variables computed in beforehand as input.
Now I'm trying to find out, by how much and in which direction I have to change each of my variables so that the loss is made smaller each time you look at a new training set.
You probably have to look at how big the loss is but I dont know how to find out which variable to change and in which direction. Is that possible and if so, how do I do that?
Currently I'm using javascript.
I am assuming that you are trying to use gradient descent to train your regression model.
Loss functions that you can use with gradient descent have to be differentiable, so simply giving/subtracting points to certain properties of the prediction is not possible.
A loss function that may be suitable for this task is the mean squared error, which is simply computed by averaging the squared differences between predicted and expected values. Your expected values would then just be the scores of both teams in the game.
You would then have to compute the gradient of the loss of your prediction with respect to the weights that the prediction function uses to compute its outputs. This can be done using backpropagation (details of which are way too broad for this answer, there are many tutorials available on the web).
The intuition behind the gradient of a function is that it points in the direction of steepest ascend of your function. If you update your parameters in that direction, the output of your function will grow. If this value is the loss of your prediction function, you want it to be smaller, so you go a small step in the opposite direction of your gradient.
I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.
I'm a little confused on the integral over ''theta'' of marginal likelihood function (http://en.wikipedia.org/wiki/Marginal_likelihood,Section: "Applications"-"Bayesian model comparison", the third equation on this page):
Why does the probability of x given M equal the integral and how to derive the equation?
This integral is nothing more than than the law of total probability in continuous form. Thus it can be derived directly from the probability axioms. Given the second formula in the link (Wikipedia), the only thing you have to do to arrive at the formula you are looking for is to replace the sum over discrete states by an integral.
So, what does it mean intuitively? You assume a model for your data X, which depends on a variable theta. For a given theta, the probability of a dataset X is thus p(X|theta). As you are not sure on the exact value of theta, you choose it to follow a distribution p(theta|alpha) specified by a (constant) parameter alpha. Now, the distribution of X is directly determined by alpha (this should be clear ... just ask yourself whether there is something other it might depend on ... and find nothing). Therefore, you can calculate its exact influence by integrating out the variable theta. This is what the law of total probability states.
If you don't get it by this explanation, I suggest you to play a bit around with conditional probabilities for discrete states, which in fact often leads to obvious results. The extension to the continuous case is then straightforward.
EDIT: The third equation shows the same which I tried to explain above. You have a model M. This model has parameters theta distributed by p(theta|M) -- you could also write this p_M(theta), for example.
These parameters determine the distribution of the data X via p(X|theta, M) ... i.e. each theta gives a different distribution of X (for a chosen model M). This form, however, is not convenient to work with. What you want is a summarized statement on the model M, not on its various possible choices for theta. So, in a way, you now want to know the average of X given a model M (note that in the model M also a chosen distribution of its parameters is included. For example, M does not simply mean "Neural Network", but rather something like "Neural Network with weights uniformly distributed in [-1,1]").
Obtaining this "average" requires only basic statistics: Just take the model, p(X|theta, M), multiply it by the density p(theta| M), and integrate over theta. This is essentially what you do for any average in statistics. All together, you arrive at the marginalization p(x|M).
This is a supervised learning problem.
I have a directed acyclic graph (DAG). Each edge has a vector of features X, and each node (vertex) has a label 0 or 1. The task is to find a cost function w(X), so that the shortest path between any pair of nodes has the highest ratio of 1s to 0s (minimum classification error).
The solution must generalize well. I tried logistic regression, and the learned logistic function predicts fairly well the label of a node giving the features of a incoming edge. However, the graph's topology is not taken into account by that approach, so the solution in the whole graph is non-optimal. In other words, the logistic function is not a good weight function given the problem setup above.
Although my problem setup is not the typical binary classification problem setup, here is a good intro to it:
http://en.wikipedia.org/wiki/Supervised_learning#How_supervised_learning_algorithms_work
Here are some more details:
Each feature vector X is a d-dimensional list of real numbers.
Each edge has a vector of features. That is, given the set of edges E = {e1, e2, .. en} and set of feature vectors F = {X1, X2 ... Xn}, then edge ei is associated to vector Xi.
It is possible to come up with a function f(X), so that f(Xi)
gives the likelihood that edge ei points to a node labeled with a 1.
An example of such function is the one I mentioned above, found through logistic
regression. However, as I mentioned above, such function is non-optimal.
SO THE QUESTION IS:
Given the graph, a starting node and an finish node, how do I learn the optimal cost function w(X), so that the ratio of nodes 1s to 0s is maximized (minimum classification error)?
This is not really an answer, but we need to clarify the question. I might come back later for a possible answer though.
Below is an example DAG.
Suppose the red node is the starting node, and the yellow one is the end node. How do you define the shortest path in terms of
the highest ratio of 1s to 0s (minimum classification error) ?
Edit: I add names for each node and two example names for the top two edges.
It seems to me you cannot learn such a cost function that takes feature vectors as inputs and whose output (edge weights? or whatever) can guide you to take a shortest path toward any node considering the graph topology. The reason is stated below:
Let's assume you don't have the feature vectors you stated. Given a graph as above, if you want to find all-pair-shortest-path with respective to the ratio of 1s to 0s, it's perfect to use Bellman equation or more specifically Dijkastra plus a proper heuristic function (e.g., percentage of 1s in the path). Another possible model-free approach is to use q-learning in which we get reward +1 for visiting a 1 node and -1 for visiting a 0 node. We learn a lookup q-table for each target node one at a time. Finally we have the all-pair-shortest-path when all nodes are treated as target nodes.
Now suppose, you magically obtained the feature vectors. Since you are able to find the optimal solution without those vectors, how come they will help when they exist?
There is one possible condition that you can use the feature vector to learn a cost function which optimize edge weights, that is, the feature vectors are dependent on the graph topology (the links between nodes and the position of 1s and 0s). But I did not see this dependency in your description at all. So I guess it does not exist.
This looks like a problem where a genetic algorithm has excellent potential. If you define the desired function as e.g. (but not limited to) a linear combination of the features (you could add quadratic terms, then cubic, ad inifititum), then the gene is the vector of coefficients. The mutator can be just a random offset of one or more coefficients within a reasonable range. The evaluation function is just the average ratio of 1's to 0's along shortest paths for all pairs according to the current mutation. At each generation, pick the best few genes as ancestors and mutate to form the next generation. Repeat until the ueber gene is at hand.
I believe your question is very close to the field of Inverse Reinforcement Learning, where you take certain "expert demonstrations" of optimal paths and try to learn a cost function such that your planner (A* or some reinforcement learning agent) outputs the same path as the expert demonstration. This training is done in an iterative way. I think that in your case, the expert demonstrations could be created by you to be paths that go through maximum number of 1 labelled edges. Here is a link to a good paper on the same: Learning to Search: Functional Gradient Techniques for Imitation Learning. It is from the robotics community where motion planning is usually setup as a graph-search problem and learning cost functions is essential for demonstrating desired behavior.