Let's say that we have an algorithm that given a dataset point, it runs some analysis on it and returns the results. The algorithm has a user-defined parameter X that affects the run-time of the algorithm (result of the algorithm is always constant for the same input point). Also, we already know that there's a relation between dataset point and the parameter X. For instance, if two dataset points are close to each other, their parameter X will also be the same.
Can we say that in this example we have the following and thus can use Q-Learning to find the best parameter X given any dataset point?
Initial state: dataset point, current value of X (for initial state = 0)
Terminal state: dataset point, current value of X (the value chosen based on action)
Actions: Different values that X can have
Reward: -1 if execution time decreases, +1 if it increases, 0 if it stays the same
Is it correct if we define different input dataset points as episodes and different values of X as the steps in each episode (where in each step, an action is chosen either randomly or via the network)? In this case, what would be the input to the neural network?
Since all of the examples and implementations I've seen so far are containing several states where each state is dependent on the previous one, I'm confused with my scenario where I only have two states.
Related
I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.
Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].
I am having a problem at hand where,
I need to classify the input data to one or more of the labels S1, S2, S3, S4
There is a relationship between the labels S1, S2, S3 and S4 which is,
If input is labelled Sn it must be labelled S1..Sn.
S1, S2, S3 and S4 are like different stages for an entity X to pass through. Based on input data X might get through one or many of the stages, X must pass through S1 to go to S2, S2 to go to S3 and so on
We want to ensure that only those X are allowed to pass which reach S3, so based on input data we decide whether to allow X to go through S1 or not
What machine learning models can we choose to predict if X reaches S3 if we have information like, input data and what stages X has passed for that input data
I am thinking in direction of a multi label classification There might be some relationship between input data stage S1 and S2
Update: I have to train with examples like
1. Input data is s1
2. Input data is s2
3. ..
4 ..
Some doubts
Your question is far from being clear, for example:
We want to optimize that most X reaches S3, so based on input data we decide whether to allow X to go through S1 or not
Actually suggest, that the best model would be "always answer yes" ,as it maximized number of objects reaching S3 (as it simply lets any object reach this point)
General ideas
I assume two possible interpretations:
You have a labels "pipeline", which simply means, that object cannot be labelled S_n if it has not been already labelled with all S_i for i < n
This does not seem to be the problem for one single model, you can pipeline models in a natural way, ie. train a model 1 which regognizes, if object x should have label S_1. Next, you train a model 2 on all data that has label S_1 in the training set and predict label S_2, and so on. During execution you simply ask each model i if it accepts (labels) the incoming object x, and stop when the first one says "no"
You have some more complex constraints on the labels, which may be strict or not.For such cases, you should try one of many methods of multi label classification with constraints, in particular there is a tech report regarding this aspect of ML.
Solution 1 - approximating test functions
If your problem can be described as:
You have data points X, such that for each of them you know the maximum number of some pipelineable tests T_i which x passes
You want to train a classifier able to predict, what is the maximum number of consequtive tests that your point x passes
You do not have access to actual tests T_i or they are very inefficient
Then the simplest way would be to apply the following training procedure instead of one classifier:
Take all your data points, label those with y=0 as 0 and those with y>=1 as 1 and train some binary classifier (for example SVM). So you simply temporarly relabel your data so it shows points that pass the first test and those who don't. Lets call this classifier cl_1
Now take your data points, label those with y=1 as 0 and those with y>=2 as 1 and again train binary classifier, and call it cl_2
Repest until all tests have their classifier, in general in we call the classifier cl_i when it can distinguish between points labeled with y=i-1 and those with y>=i.
Now, to classify your new point, you simply check iteratively all your cl_i for i=1,..,tests and answer with the largest such i that cl_i(x)=1. So you "simulate" your tests with classifiers, and simply say how many this tests' approximations it passed.
To sum up: each test can be approximated with one binary classifier, and then the question of "What is the biggest consequtive test number that our point passes" is approximated with "what is the biggest consequtive classifier number that out point is classified as true".
Solution 2 - simple regression
You can also simply apply regression from your input space into the number of tests it reaches. Regression actually has an imprinted assumption, that the output values are correlated. So if you train your data with pairs (x,y) where y is the number of last test passed by x, then you are actually using the fact, that the output y=3 is highly related to first getting y=2 in the computations. Such regression (non-linear!) could be simply done using neural networks (possibly regularized)
I've been reading some papers on CRFs and am slightly confused about the feature functions. Unary (node) and binary (edge) features f are normally of the form
f(yc, xc) = 1{yc=y ̃c}fg(xc).
where {.} is the indicator function evaluating to 1 if the condition enclosed is true, and 0 otherwise. fg is a function of the data xc which extracts useful attributes (features) from the data.
Now it seems to me that to create CRF features the true labels (yc) must be known. This is true for training but for the testing phase the true class labels are unknown (since we are trying to determine their most likely value).
Am I missing something? How can this be correctly implemented?
The idea with the CRF is that it assigns a score to each setting of the labels. So what you do, notionally, is compute the scores for all possible label assignments and then whichever labeling gets the biggest score is what the CRF predicts/outputs. This is only going to make sense if the CRF gives different scores to different label assignments. When you think of it that way it's clear that the labels must be involved in the feature functions for this to work.
So lets say the log probability function for your CRF is F(x,y). So it assigns a number to each combination of a data sample x and a labeling y. So when you get a new data sample the predicted label during test time is just argmax_y F(new_x, y). That is, you find the value of y that makes F(new_x,y) the biggest and that's the predicted labeling.
For standard Q-Learning combined with a neural network things are more or less easy.
One stores (s,a,r,s’) during interaction with the environment and use
target = Qnew(s,a) = (1 - alpha) * Qold(s,a) + alpha * ( r + gamma * max_{a’} Qold(s’, a’) )
as target values for the neural network approximation the Q-function. So the input of the ANN is (s,a) and the output is the scalar Qnew(s,a). Deep Q-Learning papers/tutorials change the structure of the Q-function. Instead of providing a single Q-value for the pair (s,a) it should now provide the Q-Values of all possible actions for a state s, so it is Q(s) instead of Q(s,a).
There comes my problem. The data base filled with (s,a,r,s’) does for a specific state s does not contain the reward for all actions. Only for some, maybe just one action. So how to setup the target values for the network Q(s) = [Q(a_1), …. , Q(a_n) ] without having all rewards for the state s in the database? I have seen different loss functions/target values but all contain the reward.
As you see; I am puzzled. Does someone help me? There are a lot of tutorials on the web but this step is in general poorly descried and even less motivated looking at the theory…
You just get the target value corresponding to the action which exists on the observation s,a,r,s'. Basically you get the target value for all actions, and then choose the maximum of them as you wrote yourself: max_{a'} Qold(s', a'). Then, it is added to the r(s,a) and the result is the target value. For example, assume you have 10 actions, and observation is (s_0, a=5, r(s_0,a=5)=123, s_1). Then, the target value is r(s_0,a=5)+ \gamma* \max_{a'} Q_target(s_1,a'). For example, with tensorflow it could be something like:
Q_Action = tf.reduce_sum(tf.multiply(Q_values,tf.one_hot(action,output_dim)), axis = 1) # dim: [batchSize , ]
in which Q_values is of size batchSize, output_dim. So, the output is a vector of size batchSize, and then there exists a vector of same size obtained as the target value. The loss is the square of the differences of them.
When you calculate loss value, also you only run the backward on the existing action and the gradient from the other action is just zero.
So, you only need the reward of the existing action.