Conditional Random Field feature functions - machine-learning

I've been reading some papers on CRFs and am slightly confused about the feature functions. Unary (node) and binary (edge) features f are normally of the form
f(yc, xc) = 1{yc=y ̃c}fg(xc).
where {.} is the indicator function evaluating to 1 if the condition enclosed is true, and 0 otherwise. fg is a function of the data xc which extracts useful attributes (features) from the data.
Now it seems to me that to create CRF features the true labels (yc) must be known. This is true for training but for the testing phase the true class labels are unknown (since we are trying to determine their most likely value).
Am I missing something? How can this be correctly implemented?

The idea with the CRF is that it assigns a score to each setting of the labels. So what you do, notionally, is compute the scores for all possible label assignments and then whichever labeling gets the biggest score is what the CRF predicts/outputs. This is only going to make sense if the CRF gives different scores to different label assignments. When you think of it that way it's clear that the labels must be involved in the feature functions for this to work.
So lets say the log probability function for your CRF is F(x,y). So it assigns a number to each combination of a data sample x and a labeling y. So when you get a new data sample the predicted label during test time is just argmax_y F(new_x, y). That is, you find the value of y that makes F(new_x,y) the biggest and that's the predicted labeling.

Related

Does Q-Learning apply here?

Let's say that we have an algorithm that given a dataset point, it runs some analysis on it and returns the results. The algorithm has a user-defined parameter X that affects the run-time of the algorithm (result of the algorithm is always constant for the same input point). Also, we already know that there's a relation between dataset point and the parameter X. For instance, if two dataset points are close to each other, their parameter X will also be the same.
Can we say that in this example we have the following and thus can use Q-Learning to find the best parameter X given any dataset point?
Initial state: dataset point, current value of X (for initial state = 0)
Terminal state: dataset point, current value of X (the value chosen based on action)
Actions: Different values that X can have
Reward: -1 if execution time decreases, +1 if it increases, 0 if it stays the same
Is it correct if we define different input dataset points as episodes and different values of X as the steps in each episode (where in each step, an action is chosen either randomly or via the network)? In this case, what would be the input to the neural network?
Since all of the examples and implementations I've seen so far are containing several states where each state is dependent on the previous one, I'm confused with my scenario where I only have two states.

Using sample_weights with fit_generator()

Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Machine Learning Model for Multi-Label Classification where we know relationship between the labels

I am having a problem at hand where,
I need to classify the input data to one or more of the labels S1, S2, S3, S4
There is a relationship between the labels S1, S2, S3 and S4 which is,
If input is labelled Sn it must be labelled S1..Sn.
S1, S2, S3 and S4 are like different stages for an entity X to pass through. Based on input data X might get through one or many of the stages, X must pass through S1 to go to S2, S2 to go to S3 and so on
We want to ensure that only those X are allowed to pass which reach S3, so based on input data we decide whether to allow X to go through S1 or not
What machine learning models can we choose to predict if X reaches S3 if we have information like, input data and what stages X has passed for that input data
I am thinking in direction of a multi label classification There might be some relationship between input data stage S1 and S2
Update: I have to train with examples like
1. Input data is s1
2. Input data is s2
3. ..
4 ..
Some doubts
Your question is far from being clear, for example:
We want to optimize that most X reaches S3, so based on input data we decide whether to allow X to go through S1 or not
Actually suggest, that the best model would be "always answer yes" ,as it maximized number of objects reaching S3 (as it simply lets any object reach this point)
General ideas
I assume two possible interpretations:
You have a labels "pipeline", which simply means, that object cannot be labelled S_n if it has not been already labelled with all S_i for i < n
This does not seem to be the problem for one single model, you can pipeline models in a natural way, ie. train a model 1 which regognizes, if object x should have label S_1. Next, you train a model 2 on all data that has label S_1 in the training set and predict label S_2, and so on. During execution you simply ask each model i if it accepts (labels) the incoming object x, and stop when the first one says "no"
You have some more complex constraints on the labels, which may be strict or not.For such cases, you should try one of many methods of multi label classification with constraints, in particular there is a tech report regarding this aspect of ML.
Solution 1 - approximating test functions
If your problem can be described as:
You have data points X, such that for each of them you know the maximum number of some pipelineable tests T_i which x passes
You want to train a classifier able to predict, what is the maximum number of consequtive tests that your point x passes
You do not have access to actual tests T_i or they are very inefficient
Then the simplest way would be to apply the following training procedure instead of one classifier:
Take all your data points, label those with y=0 as 0 and those with y>=1 as 1 and train some binary classifier (for example SVM). So you simply temporarly relabel your data so it shows points that pass the first test and those who don't. Lets call this classifier cl_1
Now take your data points, label those with y=1 as 0 and those with y>=2 as 1 and again train binary classifier, and call it cl_2
Repest until all tests have their classifier, in general in we call the classifier cl_i when it can distinguish between points labeled with y=i-1 and those with y>=i.
Now, to classify your new point, you simply check iteratively all your cl_i for i=1,..,tests and answer with the largest such i that cl_i(x)=1. So you "simulate" your tests with classifiers, and simply say how many this tests' approximations it passed.
To sum up: each test can be approximated with one binary classifier, and then the question of "What is the biggest consequtive test number that our point passes" is approximated with "what is the biggest consequtive classifier number that out point is classified as true".
Solution 2 - simple regression
You can also simply apply regression from your input space into the number of tests it reaches. Regression actually has an imprinted assumption, that the output values are correlated. So if you train your data with pairs (x,y) where y is the number of last test passed by x, then you are actually using the fact, that the output y=3 is highly related to first getting y=2 in the computations. Such regression (non-linear!) could be simply done using neural networks (possibly regularized)

Resources