Inverse transform when using SHAP values - normalization

so I scaled my data using MinMax scaler and used my existing model to find their SHAP values. However, when I printed out the SHAP summary plot I realised the scale on the x axis was the transformed values
I tried to use inverse transform on both the shap and actual values but I got something funny like the below
. Anyone knows how/where I can inverse transform the data nicely such that the x axis on the shap summary shows values from the original data set?
scaler = MinMaxScaler(feature_range=(0, 1))
train = scaler.fit_transform(train)
shap_model = build_model(train)
e = shap.DeepExplainer((shap_model.layers[0].input,
shap_model.layers[-1].output), train)
shap_val = e.shap_values(train)
shap_val = np.array(shap_val)
#inverse transform shap_values and train_day
shap_val=scaler.inverse_transform(shap_val)
train_1=scaler.inverse_transform(train)
shap.summary_plot(shap_val, train_day)

Related

Normalize and de-normalizing data in prediction model

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

Unable to inverse_transform the value of feature because of different dimensionality

I'm designing a multivariate time series model. For that I'm inputing 5 features to lstm model and try to predict the output of 1 variable(i.e. whose value is dependent on itself and other 4 features).
For that I'm doing the feature scaling as follows:-
#Features Scaling
`from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0,1))
training_set_scaled = sc.fit_transform(training_set)
print(training set scaled)`
Output:-
At the output of the model, I got the predicted value as:
However, when it tried to inverse transform it as:
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
I got the the following error:-
non-broadcastable output operand with shape (65,1) doesn't match the broadcast shape (65,5)
Please help. Thank you in advance :)
The problem is that you use sc to min-max-scale the five features. Therefore, sc can also only be used to inverse transform the scaled version of the features (shown by you as output), which would give you back the original feature values.
The label (model output) is independent from that. You can also, but do not necessarily have to scale your dependent variable, and certainly not with the same scaler object.

Difference between ordinal and categorical data as labels in scikit learn

I know that as features ordinal data could be assigned arbitrary numbers and OneHotEncoding could be done for categorical data. But I am a bit confused how these two types of data should be handled when they are the feature to be predicted. For instance in the iris dataset in scikitlearn:
iris = datasets.load_iris()
X = iris.data
y = iris.target
while the y represent three type of flowers which is a categorical data (if im not wrong?!), it is encoded as ordinal values of 0,1,2 (type=int32). My dataset also includes 3 independent categories ('sick','carrier','healthy') and scikitlearn accept them as as strings without any type of encoding.
I was wondering whether it is correct to keep them as they are to be used by scikitlearn or similar encoding as it is done for iris dataset is required?
You don't need to encode your label. scikitlearn takes care of it. Same table used to build a classifier:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)
and I just make a smaller table and change labeles from integer to string:
X1 = X[:5]
y1 = y[:5]
y1 = ['a', 'a', 'a','b', 'a']
clf = LogisticRegression(random_state=0).fit(X1, y1)
clf.predict(X1[:2, :])
clf.predict_proba(X1[:2, :])
clf.score(X1, y1)
and all good.
It seems that in ML we are either working with continuous data that will be handled by regression models or they are categorical which will be handled by classification models. There is no separate category for ordinal data.

Error while predicting a single value using a linear regression model

I'm a beginner and making a linear regression model, when I make predictions on the basis of test sets, it works fine. But when I try to predict something for a specific value. It gives an error. The tutorial I'm watching, they don't have any errors.
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
According to the Scikit-learn documentation, the input array should have shape (n_samples, n_features). As such, if you want a single example with a single value, you should expect the shape of your input to be (1,1).
This can be done by doing:
import numpy as np
test_X = np.array(6.5).reshape(-1, 1)
lin_reg.predict(test_X)
You can check the shape by doing:
test_X.shape
The reason for this is because the input can have many samples (i.e. you want to predict for multiple data points at once), or/and each sample can have many features.
Note: Numpy is a Python library to support large arrays and matrices. When scikit-learn is installed, Numpy should be installed as well.

How to apply zca on a huge image dataset with limited memory?

what google told me is:
For keras, the ImageDataGenerator function seems to have a zca_whitening which can be used out of the box. But if this option been set, it requires to call the ImageDataGenerator.fit on the whole dataset X. So this is not an option.
For sklearn, the IncrementalPCA seems to work with a huge dataset, but I don't know how to rotate PCA to ZCA in an generator style.
Thanks for the help!
I have defined a function that might be helpful following the ZCA transformation:
def ZCAtransform(X,IPCA_model):
# get the Eigenvectors and Eigenvalues
U = IPCA_model.components_.transpose()
S = np.sqrt(IPCA_model.explained_variance_)
Xdemeand = (X-np.mean(X,0)).transpose()
#get the transformed data
# Xproj' = U * diag(1/(S+I*epsilon)) * U' * X_data
return (U.dot(np.diag(1/(S+IPCA_model.noise_variance_))).dot(U.transpose()).dot(Xdemeand)).transpose()
Xproj = ZCAtransform(X, ipca)
Following the given example at Scikit-learn, I was able to generate the ZCA of Iris dataset as shown below:
ZCA Whitened PCA

Resources