Back to basics for the XOR problem - fundamentaly confused

Back to basics for the XOR problem - fundamentaly confused - machine-learning

This is something that has been bothering me for a while about XOR and MLP; it may be basic (if so, apoligies in advance), but I would like to know.
There are many approaches to solving XOR with MLP, but generally they look like this:
from sklearn.neural_network import MLPClassifier
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]
model = MLPClassifier(
activation='relu', max_iter=1000, hidden_layer_sizes=(4,2))
Now to fit the model:
model.fit(X, y)
And, guess what?
print('score:', model.score(X, y))
outputs a perfect
score: 1.0
But what is being predicted and scored? In the case of XOR we have a dataset which, by definition(!) has four rows, two features and one binary label. There is no standard X_train, y_train, X_test, y_test to work with. By definition, again, there is no unseen data for the model to digest.
The prediction takes place in the form of
model.predict(X)
which is exactly the same X that training was performed on.
So doesn't the model just spit back the y it was trained on? How do we know the model "learned" anything?
EDIT: Just to try to clarify what baffles me - the features have 2 and only 2 unique values; the 2 unique values have 4 and only 4 possible combinations. The right label for each possible combination is already present in the label column. So what is there for the model to "learn" when fit() is called? And how is this "learning" performed? How can the model ever be "wrong" when it has access to the "right" answer for each possible combination of inputs?
Again, sorry for what is probably a very basic question.

The key thing is that XOR problem was proposed to demonstrate how some models can learn non-linear problems and some models can't.
So when a model gets 1.0 accuracy on the dataset you mentioned, it's notable since it has learned a non-linear problem. The fact that it has learned the training data is enough for us to know that it can [potentially] learn non-linear models. Notice that if this wasn't the case your model would get a very low accuracy like 0.25 since it divides the 2D space into two sub-spaces by a line.
To understand this better, let's see a case where a model can't learn the data under this same circumstances:
import tensorflow as tf
import numpy as np
X = np.array(X)
y = np.array(y)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(2, activation='relu'))
model.compile(optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1), loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit(X, y, epochs=100)
_, acc = model.evaluate(X, y)
print('acc = ' + str(acc))
which gives:
acc = 0.5
As you can see this model can't classify the data it has already seen. The reason is, this is a non-linear data and our model can only classify linear data.(here is a link to understand the non-linearity of XOR problem better). As soon as we add another layer to our network it will be able to solve this problem:
import tensorflow as tf
import numpy as np
X = np.array(X)
y = np.array(y)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(1, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='relu'))
tb_callback = tf.keras.callbacks.TensorBoard(log_dir='./test/', write_graph=True)
model.compile(optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1), loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit(X, y, epochs=5, callbacks=[tb_callback, ])
acc = model.evaluate(X, y)
print('acc = ' + str(acc))
which gives:
acc = 1.0
By adding only one neuron our model learned to do what it couldn't learn in 100 epochs with 1 layer (even though it had already seen the data).
So to sum up, it is correct that our dataset is so small that the network can easily memorize it but the XOR problem is important because it means there are networks that can't memorize this data no matter what.
Having said that however, there are varsities of XOR problems with proper train and test sets. here is one (the plot is slightly different):
import numpy as np
import matplotlib.pyplot as plt
x1 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
y1 =np.concatenate([np.random.uniform(-100, 0, 100), np.random.uniform(0, 100, 100)])
x2 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
y2 =np.concatenate([np.random.uniform(0, 100, 100), np.random.uniform(-100, 0, 100)])
plt.scatter(x1, y1, c='red')
plt.scatter(x2, y2, c='blue')
plt.show()
hope that helped ;))

Related

Why doesn't UMAP map similar data to same region in learned latent space?

My question concerns the ability of UMAP to transform new data in a way that intuitively makes sense. There is a nice example of this in the documentation: https://umap-learn.readthedocs.io/en/latest/transform.html. Here MNIST data is used to train a model, and new (withheld) MNIST data is passed to the model, with the result that the new data is mapped into the expected regions of learned space (i.e. same as training data). However, when I try this on a synthetic dataset I'm unable to reproduce this behavior.
I first created a set of 500 training examples, X1, each with 32 features. These are generated randomly using np.random.rand and have values between 0 and 1. I then created a small test set of 10 examples, X2, where all examples are identical to one of the training samples, except it is multiplied by 10. It is therefore much different than training data. See plot comparing a single example from X1 and X2.
import numpy as np
import random
from sklearn.decomposition import PCA
import umap
from matplotlib import pyplot as plt
X1 = np.random.rand(500, 32)
X2 = np.copy(X1[0:10])
X2[0:10] = X1[0]*10
plt.plot(X1[0], 'go-', label='train')
plt.plot(X2[0], 'bo-', label='test')
plt.legend();
Next I compare the performance of PCA and UMAP trained on X1 and used to transform X2. The results are shown below.
pca = PCA(n_components=2, random_state=11)
X1_pca = pca.fit_transform(X1)
X2_pca = pca.transform(X2)
umap = umap.UMAP(n_components=2, random_state=11)
X1_umap = umap.fit_transform(X1)
X2_umap = umap.transform(X2)
plt.subplot(1, 2, 1)
plt.scatter(X1_pca[:, 0], X1_pca[:, 1], s=40, c='blue', alpha=0.25, label='train')
plt.scatter(X2_pca[:, 0], X2_pca[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('PCA')
plt.subplot(1, 2, 2)
plt.scatter(X1_umap[:, 0], X1_umap[:, 1], s=40, c='blue', alpha=0.25, label='train')
plt.scatter(X2_umap[:, 0], X2_umap[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('UMAP')
plt.show()
PCA generates a result that one would naively expect. The training data is clustered together, the test data is clearly separated, and all the points within the test data are mapped to the same point in 2D space. UMAP, on the other hand, does neither. That is, the test data is not clearly differentiated from the training data, and even though the test data consists of identical features, they are not even mapped to the same region in the 2D space.
Am I doing something wrong, or is my understanding of how UMAP should work incorrect? Any help is greatly appreciated.

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here

The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

How to interpret MSE in Keras Regressor

I am trying to build a model to predict house prices.
I have some features X (no. of bathrooms , etc.) and target Y (ranging around $300,000 to $800,000)
I have used sklearn's Standard Scaler to standardize Y before fitting it to the model.
Here is my Keras model:
def build_model():
model = Sequential()
model.add(Dense(36, input_dim=36, activation='relu'))
model.add(Dense(18, input_dim=36, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mae','mse'])
return model
I am having trouble trying to interpret the results -- what does a MSE of 0.617454319755 mean?
Do I have to inverse transform this number, and square root the results, getting an error rate of 741.55 in dollars?
math.sqrt(sc.inverse_transform([mse]))
I apologise for sounding silly as I am starting out!

I apologise for sounding silly as I am starting out!
Do not; this is a subtle issue of great importance, which is usually (and regrettably) omitted in tutorials and introductory expositions.
Unfortunately, it is not as simple as taking the square root of the inverse-transformed MSE, but it is not that complicated either; essentially what you have to do is:
Transform back your predictions to the initial scale of the original data
Get the MSE between these invert-transformed predictions and the original data
Take the square root of the result
in order to get a performance indicator of your model that will be meaningful in the business context of your problem (e.g. US dollars here).
Let's see a quick example with toy data, omitting the model itself (which is irrelevant here, and in fact can be any regression model - not only a Keras one):
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
# toy data
X = np.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
Y = np.array([3, 4, 5, 6, 7])
# feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X)
# outcome scaling:
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y.reshape(-1, 1))
Y_train
# array([[-1.41421356],
# [-0.70710678],
# [ 0. ],
# [ 0.70710678],
# [ 1.41421356]])
Now, let's say that we fit our Keras model (not shown here) using the scaled sets X_train and Y_train, and get predictions on the training set:
prediction = model.predict(X_train) # scaled inputs here
print(prediction)
# [-1.4687586 -0.6596055 0.14954728 0.95870024 1.001172 ]
The MSE reported by Keras is actually the scaled MSE, i.e.:
MSE_scaled = mean_squared_error(Y_train, prediction)
MSE_scaled
# 0.052299712818541934
while the 3 steps I have described above are simply:
MSE = mean_squared_error(Y, sc_Y.inverse_transform(prediction)) # first 2 steps, combined
MSE
# 0.10459946572909758
np.sqrt(MSE) # 3rd step
# 0.323418406602187
So, in our case, if our initial Y were US dollars, the actual error in the same units (dollars) would be 0.32 (dollars).
Notice how the naive approach of inverse-transforming the scaled MSE would give a very different (and incorrect) result:
np.sqrt(sc_Y.inverse_transform([MSE_scaled]))
# array([2.25254588])

MSE is mean square error, here is the formula.
Basically it is a mean of square of different of expected output and prediction. Making square root of this will not give you the difference between error and output. This is useful for training.
Currently you have build a model.
If you want to train the model use these function.
mode.fit(x=input_x_array, y=input_y_array, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None)
If you want to do prediction of the output you should use following code.
prediction = model.predict(np.array(input_x_array))
print(prediction)
You can find more details here.
https://keras.io/models/about-keras-models/
https://keras.io/models/sequential/

How to calculate prediction uncertainty using Keras?

I would like to calculate NN model certainty/confidence (see What my deep model doesn't know) - when NN tells me an image represents "8", I would like to know how certain it is. Is my model 99% certain it is "8" or is it 51% it is "8", but it could also be "6"? Some digits are quite ambiguous and I would like to know for which images the model is just "flipping a coin".
I have found some theoretical writings about this but I have trouble putting this in code. If I understand correctly, I should evaluate a testing image multiple times while "killing off" different neurons (using dropout) and then...?
Working on MNIST dataset, I am running the following model:
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D, Flatten, Dropout
model = Sequential()
model.add(Conv2D(128, kernel_size=(7, 7),
activation='relu',
input_shape=(28, 28, 1,)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model.fit(train_data, train_labels, batch_size=100, epochs=30, validation_data=(test_data, test_labels,))
How should I predict with this model so that I get its certainty about predictions too? I would appreciate some practical examples (preferably in Keras, but any will do).
To clarify, I am looking for an example of how to get certainty using the method outlined by Yurin Gal (or an explanation of why some other method yields better results).

If you want to implement dropout approach to measure uncertainty you should do the following:
Implement function which applies dropout also during the test time:
import keras.backend as K
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
Use this function as uncertainty predictor e.g. in a following manner:
def predict_with_uncertainty(f, x, n_iter=10):
result = numpy.zeros((n_iter,) + x.shape)
for iter in range(n_iter):
result[iter] = f(x, 1)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Of course you may use any different function to compute uncertainty.

Made a few changes to the top voted answer. Now it works for me.
It's a way to estimate model uncertainty. For other source of uncertainty, I found https://eng.uber.com/neural-networks-uncertainty-estimation/ helpful.
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
def predict_with_uncertainty(f, x, n_iter=10):
result = []
for i in range(n_iter):
result.append(f([x, 1]))
result = np.array(result)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty

Your model uses a softmax activation, so the simplest way to obtain some kind of uncertainty measure is to look at the output softmax probabilities:
probs = model.predict(some input data)[0]
The probs array will then be a 10-element vector of numbers in the [0, 1] range that sum to 1.0, so they can be interpreted as probabilities. For example the probability for digit 7 is just probs[7].
Then with this information you can do some post-processing, typically the predicted class is the one with highest probability, but you can also look at the class with second highest probability, etc.

A simpler way is to set training=True on any dropout layers you want to run during inference as well (essentially tells the layer to operate as if it's always in training mode - so it is always present for both training and inference).
import keras
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(3)(inputs)
outputs = keras.layers.Dropout(0.5)(x, training=True)
model = keras.Model(inputs, outputs)
Code above is from this issue.

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.
When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.
Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.

This is a known interesting (and often painful!) problem with hierarchical predictions. A problem with training a number of predictors over the train data, then training a higher predictor over them, again using the train data - has to do with the bias-variance decomposition.
Suppose you have two predictors, one essentially an overfitting version of the other, then the former will appear over the train set to be better than latter. The combining predictor will favor the former for no true reason, just because it cannot distinguish overfitting from true high-quality prediction.
The known way of dealing with this is to prepare, for each row in the train data, for each of the predictors, a prediction for the row, based on a model not fit for this row. For the overfitting version, e.g., this won't produce a good result for the row, on average. The combining predictor will then be able to better assess a fair model for combining the lower-level predictors.
Shahar Azulay & I wrote a transformer stage for dealing with this:
class Stacker(object):
"""
A transformer applying fitting a predictor `pred` to data in a way
that will allow a higher-up predictor to build a model utilizing both this
and other predictors correctly.
The fit_transform(self, x, y) of this class will create a column matrix, whose
each row contains the prediction of `pred` fitted on other rows than this one.
This allows a higher-level predictor to correctly fit a model on this, and other
column matrices obtained from other lower-level predictors.
The fit(self, x, y) and transform(self, x_) methods, will fit `pred` on all
of `x`, and transform the output of `x_` (which is either `x` or not) using the fitted
`pred`.
Arguments:
pred: A lower-level predictor to stack.
cv_fn: Function taking `x`, and returning a cross-validation object. In `fit_transform`
th train and test indices of the object will be iterated over. For each iteration, `pred` will
be fitted to the `x` and `y` with rows corresponding to the
train indices, and the test indices of the output will be obtained
by predicting on the corresponding indices of `x`.
"""
def __init__(self, pred, cv_fn=lambda x: sklearn.cross_validation.LeaveOneOut(x.shape[0])):
self._pred, self._cv_fn = pred, cv_fn
def fit_transform(self, x, y):
x_trans = self._train_transform(x, y)
self.fit(x, y)
return x_trans
def fit(self, x, y):
"""
Same signature as any sklearn transformer.
"""
self._pred.fit(x, y)
return self
def transform(self, x):
"""
Same signature as any sklearn transformer.
"""
return self._test_transform(x)
def _train_transform(self, x, y):
x_trans = np.nan * np.ones((x.shape[0], 1))
all_te = set()
for tr, te in self._cv_fn(x):
all_te = all_te | set(te)
x_trans[te, 0] = self._pred.fit(x[tr, :], y[tr]).predict(x[te, :])
if all_te != set(range(x.shape[0])):
warnings.warn('Not all indices covered by Stacker', sklearn.exceptions.FitFailedWarning)
return x_trans
def _test_transform(self, x):
return self._pred.predict(x)
Here is an example of the improvement for the setting described in #MaximHaytovich's answer.
First, some setup:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import ensemble
from sklearn import metrics
y = np.random.randn(100)
x0 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x1 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x = np.zeros((100, 2))
Note that x0 and x1 are just noisy versions of y. We'll use the first 80 rows for train, and the last 20 for test.
These are the two predictors: a higher-variance gradient booster, and a linear predictor:
g = ensemble.GradientBoostingRegressor()
l = linear_model.LinearRegression()
Here is the methodology suggested in the answer:
g.fit(x0[: 80, :], y[: 80])
l.fit(x1[: 80, :], y[: 80])
x[:, 0] = g.predict(x0)
x[:, 1] = l.predict(x1)
>>> metrics.r2_score(
y[80: ],
linear_model.LinearRegression().fit(x[: 80, :], y[: 80]).predict(x[80: , :]))
0.940017788444
Now, using stacking:
x[: 80, 0] = Stacker(g).fit_transform(x0[: 80, :], y[: 80])[:, 0]
x[: 80, 1] = Stacker(l).fit_transform(x1[: 80, :], y[: 80])[:, 0]
u = linear_model.LinearRegression().fit(x[: 80, :], y[: 80])
x[80: , 0] = Stacker(g).fit(x0[: 80, :], y[: 80]).transform(x0[80:, :])
x[80: , 1] = Stacker(l).fit(x1[: 80, :], y[: 80]).transform(x1[80:, :])
>>> metrics.r2_score(
y[80: ],
u.predict(x[80:, :]))
0.992196564279
The stacking prediction does better. It realizes that the gradient booster is not that great.

Ok, after spending some time on googling 'stacking' (as mentioned by #andreas earlier) I found out how I could do the weighting in python even with scikit-learn. Consider the below:
I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.
Then I just train a linear regression on this new dataframe:
#df - dataframe with results of 3 regressors and true output
from sklearn linear_model
stacker= linear_model.LinearRegression()
stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])
So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:
stacker.predict()
on outputs of my 3 regressors. And get a result.
The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.

What you describe is called "stacking" which is not implemented in scikit-learn yet, but I think contributions would be welcome. An ensemble that just averages will be in pretty soon: https://github.com/scikit-learn/scikit-learn/pull/4161

Late response, but I wanted to add one practical point for this sort of stacked regression approach (which I use this frequently in my work).
You may want to choose an algorithm for the stacker which allows positive=True (for example, ElasticNet). I have found that, when you have one relatively stronger model, the unconstrained LinearRegression() model will often fit a larger positive coefficient to the stronger and a negative coefficient to the weaker model.
Unless you actually believe that your weaker model has negative predictive power, this is not a helpful outcome. Very similar to having high multi-colinearity between features of a regular regression model. Causes all sorts of edge effects.
This comment applies most significantly to noisy data situations. If you're aiming to get RSQ of 0.9-0.95-0.99, you'd probably want to throw out the model which was getting a negative weighting.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Back to basics for the XOR problem - fundamentaly confused - machine-learning

Related

Why doesn't UMAP map similar data to same region in learned latent space?

Model validation with separate dataset

How to interpret MSE in Keras Regressor

How to calculate prediction uncertainty using Keras?

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

Categories

Resources