Weka OneR gives ? as classifier model - machine-learning

My Weka OneR models are all returning what seems like an overfit set, concluding with a question mark leading to a certain results like so:
FollowersMeanCoords_Col:
< 0.33340000000000003 -> False
>= 0.33340000000000003 -> True
? -> False
(114357/163347 instances correct)
Is this OneR simply saying "I can't find anything, so we assume the rest is false"? But then, why is there a clear cut in the date (everything below 0.33 is False, above or equal is True)? And is there a way to prevent this?
Thanks in advance!

The ? refers to missing values - your training data must have some values of FollowersMeanCoords_Col missing for some instances.
The model in your question says that if FollowersMeanCoords_Col for an instance (data point) is less than 0.3334..., or is missing, it will classify the instance as False, otherwise it will classify it as True.
OneR is a very simple classification algorithm which works by finding the one attribute from the training data that gives the least error when used to make a classification rule. For OneR to overfit there would need to be an attribute that happened to classify the training data well, but didn't generalise to future test data. It's more likely that OneR will give you models that are robust but inaccurate.

Related

what does lightgbm python Dataset reference parameter mean?

I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!
The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".
still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)

Keras LSTM: Injecting already-known *future* values into prediction

I've built an LSTM In Keras with the goal of predicting future values of a time-series from a high-dimensional, time-index input.
However, there's a unique requirement: for certain time points in the future, we know with certainty what some values of the input series will be. For example:
model = SomeLSTM()
trained_model = model.train(train_data)
known_data = [(24, {feature: 2, val: 7.0}), (25, {feature: 2, val: 8.0})]
predictions = trained_model(look_ahead=48, known_data=known_data)
Which would train the model up to time t (the end of training), and predict forward 48 time periods from time t, but substituting known_data values for feature 2 at times 24 and 25.
How exactly can I explicitly inject this into the LSTM at some time?
For reference, here's the model:
model = Sequential()
model.add(LSTM(hidden, input_shape=(look_back, num_features)))
model.add(Dropout(dropout))
model.add(Dense(look_ahead))
model.add(Activation('linear'))
This may be a result of my un-intuitive grasp of LSTMs, and I'd appreciate any clarification. I've dived into the Keras source code, and my first guess is to inject it right into the LSTM state variable, but I'm unsure how to do that at time t (or even if that is correct.)
I think a clean way of doing this is to introduce 2*look_ahead new features, where for each 0 <= i < look_ahead 2*i-th feature is an indicator whether the value of the i-th time step is known and (2*i+1)-th is the value itself (0 if not known). Accordingly, you can generate training data with these features to make your model take into account these known values.
I am not exactly sure what you are trying to do, but maybe create your own layer to go at the end that sets the data to the known values, similar to how dropout sets random values to zero. As a side note, I have had better results with pooling than dropout, so maybe try switching that out and training it. Here is a good guide on how to do it. https://www.tutorialspoint.com/keras/keras_customized_layer.htm

Using machine learning to estimate likelihood of an even occurrence given a stream of data

I have a stream of data (e.g. 3D position) generating by a system which it looks like:
(pos1, time1) (Pos2, time2) (pos3, time3) ...
I want to use a machine learning technique to estimate the likelihood (or detect) of a particular event from given stream of data.
What I have done:
I've tagged my data at every frame by YES if the event occurred at that frame, otherwise it is set to NO.
(pos1, time1, NO) (Pos2, time2, Yes) (pos3, time3, NO) ...(posK, timeK, Yes)...
set a window length like L to train model by giving L consecutive frames and the corresponding tag is set by the tag of the last element on that window:
(pos1, Pos2, pos3, NO)
(pos2, Pos3, pos4, NO)
(pos3, Pos4, pos5, NO)
...
(posK-2, PosK-1, posK, YES)
...
Finally, I trained my model by this set of that.
For Testing, I concatenate L consecutive frames and ask the model to find the corresponding tag for this set of data (e.g. YES or NO).
I realize that occurrence of "NO" is a lot more frequent that "YES". Simply because the system is mostly on idle state and I have no event. So it affects on the training.
Could you give me some hints:
1) what type of machine learning model is the best fit for this problem.
2) At the moment I am classifying the output either "YES" or "NO" but I would like to have the probability of occurrence of the event at anytime. What kind of model is do you suggest?
Thanks
I think there are actually two questions, here: how to build the dataset, and which predictor to use.
For building the dataset, at some time point i, make sure to choose the &ell; instances happening before i (the phrasing in your question made it seem that you're choosing the one including i). The label of the outcome should be the one at i, though. After all, you're attempting to predict the future based on the present, no? Predicting the present based on the present is rather easy.
Another point is how to choose &ell;, or even whether to choose a single &ell;. Note that if you choose a number of different values of &ell;, then you get a multivariate model.
Finally, the question you directly asked is which predictor to use. This is too wide to answer without knowing your dataset (and playing with it). You might want to read about the bias-variance tradeoff to see why there is no "best" predictor for some problem.
Having said that, I'd suggest that you start with logistic regression which is a simple and robust classifier that also outputs probabilities (as you asked).

How do you actually apply a trained model?

I've been slowly going through the tensorflow tutorials, and I assume I will have to again. I don't have a background in ML but am slowly pushing my way up.
Anyway, after reading through the RNN tutorial and running the training code, I am confused.
How does one actually apply the trained model so that it can be used to make language predictions?
I know this is a terrible noobish and simple question, but I believe it will be of use to others, as I have seen it asked and not answered in a satisfactory way.
In general, when you train a model, you first do a forward pass, and then a backward pass. The forward pass makes a prediction based on your input data, and the backward pass adjust your model based on how correct your prediction was.
So when you want to apply your model, you just do a forward pass with your new data as input.
In your particular example, using this code, you can see how it's done by looking at how they run the test set, starting line 286.
# They instantiate the model with is_training=False
mtest = PTBModel(is_training=False, config=eval_config)
# Then they can do a forward pass
test_perplexity = run_epoch(session, mtest, test_data, tf.no_op())
print("Test Perplexity: %.3f" % test_perplexity)
And if you want the actual prediction and not the perplexity, it is the state in the run_epoch function :
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Resources