Error while predicting a single value using a linear regression model - machine-learning

I'm a beginner and making a linear regression model, when I make predictions on the basis of test sets, it works fine. But when I try to predict something for a specific value. It gives an error. The tutorial I'm watching, they don't have any errors.
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

According to the Scikit-learn documentation, the input array should have shape (n_samples, n_features). As such, if you want a single example with a single value, you should expect the shape of your input to be (1,1).
This can be done by doing:
import numpy as np
test_X = np.array(6.5).reshape(-1, 1)
lin_reg.predict(test_X)
You can check the shape by doing:
test_X.shape
The reason for this is because the input can have many samples (i.e. you want to predict for multiple data points at once), or/and each sample can have many features.
Note: Numpy is a Python library to support large arrays and matrices. When scikit-learn is installed, Numpy should be installed as well.

Related

LightGBM predicts negative values

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629
LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)
LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

Normalize and de-normalizing data in prediction model

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

Why Naive Bayes gives results and on training and test but gives error of negative values when applied with GridSerchCV?

I have studied some related questions regarding Naive Bayes, Here are the links. link1, link2,link3 I am using TF-IDF for feature selection and Naive Bayes for classification. After fitting the model it gave the prediction successfully. and here is the output
accuracy = train_model(model, xtrain, train_y, xtest)
print("NB, CharLevel Vectors: ", accuracy)
NB, accuracy: 0.5152523571824736
I don't understand the reason why Naive Bayes did not give any error in the training and testing process
from sklearn.preprocessing import PowerTransformer
params_NB = {'alpha':[1.0], 'class_prior':[None], 'fit_prior':[True]}
gs_NB = GridSearchCV(estimator=model,
param_grid=params_NB,
cv=cv_method,
verbose=1,
scoring='accuracy')
Data_transformed = PowerTransformer().fit_transform(xtest.toarray())
gs_NB.fit(Data_transformed, test_y);
It gave this error
Negative values in data passed to MultinomialNB (input X)
TL;DR: PowerTransformer, which you seem to apply only in the GridSearchCV case, produces negative data, which makes MultinomialNB to expectedly fail, es explained in detail below; if your initial xtrain and ytrain are indeed TF-IDF features, and you do not transform them similarly with PowerTransformer (you don't show something like that), the fact that they work OK is also unsurprising and expected.
Although not terribly clear from the documentation:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
reading closely you realize that it implies that all the features should be positive.
This has a statistical basis indeed; from the Cross Validated thread Naive Bayes questions: continus data, negative data, and MultinomialNB in scikit-learn:
MultinomialNB assumes that features have multinomial distribution which is a generalization of the binomial distribution. Neither binomial nor multinomial distributions can contain negative values.
See also the (open) Github issue MultinomialNB fails when features have negative values (it is for a different library, not scikit-learn, but the underlying mathematical rationale is the same).
It is not actually difficult to demonstrate this; using the example available in the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100)) # random integer data
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y) # works OK
# inspect X
X # only 0's and positive integers
Now, changing a single element of X to a negative number and trying to fit again:
X[1][0] = -1
clf.fit(X, y)
gives indeed:
ValueError: Negative values in data passed to MultinomialNB (input X)
What can you do? As the Github thread linked above suggests:
Either use MinMaxScaler(), which will bring all the features to [0, 1]
Or use GaussianNB instead, which does not suffer from this limitation

How to L2 normalize an array with Swift

I am trying normalize the input of my CoreML model like below, it kind of does something to the array but its quite different then what SKLearn does(I give same input and watch output in these environments). So appereantly I do something wrong.
My Model is trained with Keras and SKlearn and it must do the same normalization as I did using SKLearn Normalizer, which is the default L2 normalizer. What I am doing below apperantly is not equalivant of sklearn, any ideas?
vDSP_normalizeD(vec, 1, &normalizedVec, 1, &mean, &std, vDSP_Length(count))
let (normalizedXVec, _, _) = normalize(vec: doubleArray)
Then here I convert normalizedXVec to MLMultiArray and use as input to my predictor
Note: I also tried to convert the normalizer from sklearn using coreml tools but I got errors as seen here:
vDSP_normalizeD uses the mean and standard deviation. That is not the same as L2.
The L2 normalization first computes the L2-norm of the vector, which is the same as sqrt(v[0]*v[0] + v[1]*v[1] + ... + v[n]*v[n]) and then it divides each element of the vector by that number.

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.

Resources