Linear Regression script not working in Python - machine-learning

I tried running my Machine Learning LinearRegression code, but it is not working. Here is the code:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
df = pd.read_csv(r'C:\Users\SVISHWANATH\Downloads\datasets\GGP_data.csv')
df["OHLC"] = (df.open+df.high+df.low+df.close)/4
df['HLC'] = (df.high+df.low+df.close)/3
df.index = df.index+1
reg = LinearRegression()
reg.fit(df.index, df.OHLC)
Basically, I just imported a few libraries, used the read_csv function, and called the LinearRegression() function, and this is the error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 1 2 3 ... 1257 1258 1259].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample
Thanks!

As mentioned in the error message, you need to give the fit method a 2D array.
df.index is a 1D array. You can do it this way:
reg.fit(df.index.values.reshape(-1, 1), df.OHLC)

Related

create error bars for random forest regression

I'm new to the world of machine learning and more generally to AI.
I am analyzing a dataset containing characteristics of different houses and their prices using Python and JupyterLab.
Here is the dataset in use:
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
I applied random forest (scikit-learn) on this dataset and now I would like to plot the error bars of the model.
Specifically, I'm using the ForestCI package and applying exactly this code to my case:
http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html
This is my code:
# Regression Forest Example
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
from sklearn.metrics import r2_score
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import sklearn.model_selection as xval
import forestci as fci
#import dataset
mpg_data = pd.read_csv(path_to_dataset)
#drop some useless features
mpg_data=mpg_data.drop('date', axis=1)
mpg_data=mpg_data.drop('yr_built', axis=1)
mpg_data = mpg_data.drop(["id"],axis=1)
#separate mpg data into predictors and outcome variable
mpg_X = mpg_data.drop(labels='price', axis=1)
mpg_y = mpg_data['price']
# remove rows where the data is nan
not_null_sel = np.where(mpg_X.isna().sum(axis=1).values == 0)
mpg_X = mpg_X.values[not_null_sel]
mpg_y = mpg_y.values[not_null_sel]
# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(
mpg_X,
mpg_y,
test_size=0.25,
random_state=42)
# Create RandomForestRegressor
mpg_forest = RandomForestRegressor(random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train)
mpg_y_hat = mpg_forest.predict(mpg_X_test)
# Plot predicted MPG without error bars
plt.scatter(mpg_y_test, mpg_y_hat)
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
print(r2_score(mpg_y_test, mpg_y_hat))
# Calculate the variance
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
mpg_X_test)
# Plot error bars for predicted MPG using unbiased variance
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
It seems to work but when the graphs are plotted, neither the error bar nor the prediction line appears:
Instead, as visible in the documentation, it should look like the picture here: http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html
You forget to add this line
plt.plot([5, 45], [5, 45], 'k--')
Your code should look like this
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()

Value Error - Error when checking target - LSTM

About the dataset
The following Reuters dataset contains 11228 texts that correspond to news classified in 46 categories. The texts are encripted in the sense that each word correspond to an integer number. I specify that we want to work with 2000 words.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
num_words = 2000
(reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) = tf.keras.datasets.reuters.load_data(num_words=num_words)
n_labels = np.unique(reuters_train_y).shape[0]
print("labels: {}".format(n_labels))
# This is the first new
print(reuters_train_x[0])
Implementing the LSTM
I need to implement a network with a single LSTM with 10 units. The input needs an embedding with 10 dimensions before entering the LSTM cell. Finally, a dense layer needs to be added to adjust the number of outputs with the number of categories.
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding
from from tensorflow.keras.utils import to_categorical
reuters_train_y = to_categorical(reuters_train_y, 46)
reuters_test_y = to_categorical(reuters_test_y, 46)
model = Sequential()
model.add(Embedding(input_dim = num_words, 10))
model.add(LSTM(10))
model.add(Dense(46,activation='softmax'))
Training
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(reuters_train_x,reuters_train_y,epochs=20,validation_data=(reuters_test_x,reuters_test_y))
The error message that I get is:
ValueError: Error when checking target: expected dense_2 to have shape (46,) but got array with shape (1,)
You need to one-hot-encode your y labels.
from tensorflow.keras.utils import to_categorical
reuters_train_y = to_categorical(reuters_train_y, 46)
reuters_test_y = to_categorical(reuters_test_y, 46)
Another bug I see in the fit function, you are passing validation_data=(reuters_test_x,reuters_train_y) but it should be validation_data=(reuters_test_x,reuters_test_y)
Your x is a numpy array of lists with different lengths. You need to pad the sequences to get a fixed shape numpy array.
reuters_train_x = tf.keras.preprocessing.sequence.pad_sequences(
reuters_train_x, maxlen=50
)
reuters_test_x = tf.keras.preprocessing.sequence.pad_sequences(
reuters_test_x, maxlen=50
)

Keras: Model Compilation Giving "Index 200005 is out of bounds for axis 0 with size 200000" Error

I'm using Jena Climate Data that my book gives a link to. I have it below;
https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
I tried messing with it but I have no clue why the index is surpassing 200000. I'm not sure why it gets to 200005 since my training data is 200001 observations long.
I've also gotten an error that said, " Index 200000 is out of bounds for axis 0 with size 200000."
The data is 420551x14 of weather data. My code is as follows:
import pandas as pd
import numpy as np
import keras
data = pd.read_csv("D:\\School\\Spring_2019\\GraduateProject\\jena_climate_2009_2016_Data\\jena_climate_2009_2016.csv")
data = data.iloc[:,data.columns!='Date Time']
data
# Standardize the Data
from sklearn import preprocessing
data = preprocessing.scale(data[:200000])
# Build Generators
from keras.preprocessing.sequence import TimeseriesGenerator
target = data[:,1] # Should target be scaled?
# ? Do I need to remove targets from the data variable?
trainGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=0,
end_index=200000)
valGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=199999,
end_index=300000)
testGen = TimeseriesGenerator(data,targets=target,length=6,
batch_size=128,
start_index=300000,
end_index=420550)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
from keras.layers import LSTM
#Flatten part is: 240 = lookback//step. This is 1440/6 because we are looking at
model = Sequential()
model.add(layers.Flatten(input_shape=(240,data.shape[-1])))
model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(1))
val_steps = 300000-200001-1440
model.compile(optimizer=RMSprop(),loss='mae')
history = model.fit_generator(trainGen,
steps_per_epoch=250,
epochs=20,
validation_data=valGen,
validation_steps=val_steps)
Let me know if you need anything else and thank you greatly in advance.
Well, you've only selected first 200000 rows for your data (data = preprocessing.scale(data[:200000]), so validation and test generators are out of bounds (index > 200000)

Keras Regressor giving different prediction for my input everytime

I built a Keras regressor using the following code:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as ny
import pandas
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
X = ny.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
sc_X=StandardScaler()
X_train = sc_X.fit_transform(X)
Y = ny.array([3, 4, 5, 6, 7])
Y=ny.reshape(Y,(-1,1))
sc_Y=StandardScaler()
Y_train = sc_Y.fit_transform(Y)
N = 5
def brain():
#Create the brain
br_model=Sequential()
br_model.add(Dense(3, input_dim=2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(1,kernel_initializer='normal'))
#Compile the brain
br_model.compile(loss='mean_squared_error',optimizer='adam')
return br_model
def predict(X,sc_X,sc_Y,estimator):
prediction = estimator.predict(sc_X.fit_transform(X))
return sc_Y.inverse_transform(prediction)
estimator = KerasRegressor(build_fn=brain, epochs=1000, batch_size=5,verbose=0)
# print "Done"
estimator.fit(X_train,Y_train)
prediction = estimator.predict(X_train)
print predict(X,sc_X,sc_Y,estimator)
X_test = ny.array([[1.5,4.5], [7,8], [9,10]])
print predict(X_test,sc_X,sc_Y,estimator)
The issue I face is that the code is not predicting the same value (for example, it predicting 6.64 for [9,10] in the first prediction (X) and 6.49 for [9,10] in the second prediction (X_test) )
The full output is this:
[2.9929883 4.0016675 5.0103474 6.0190268 6.6434317]
[3.096634 5.422326 6.4955378]
Why do I get different values and how do I resolve them?
The problem lies in this line of code:
prediction = estimator.predict(sc_X.fit_transform(X))
You are fitting a new scaler every time when you predict values for new data. This is where differences come from. Try:
prediction = estimator.predict(sc_X.transform(X))
In this case, you use a pretrained scaler.

KerasRegressor giving different output everytime I run (despite inputs and training set being same)

Whenever I run the following code, I keep getting different outputs. Please could someone help me out with this? Code:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
import numpy as ny
X = ny.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
sc_X=StandardScaler()
X_train = sc_X.fit_transform(X)
Y = ny.array([3, 4, 5, 6, 7])
Y=ny.reshape(Y,(-1,1))
sc_Y=StandardScaler()
Y_train = sc_Y.fit_transform(Y)
N = 5
def brain():
#Create the brain
br_model=Sequential()
br_model.add(Dense(3, input_dim=2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(1,kernel_initializer='normal'))
#Compile the brain
br_model.compile(loss='mean_squared_error',optimizer='adam')
return br_model
estimator = KerasRegressor(build_fn=brain, epochs=1000, batch_size=5,verbose=0)
estimator.fit(X_train,Y_train)
prediction = estimator.predict(X_train)
print Y
print sc_Y.inverse_transform(prediction)
Basically, I have declared a dataset, am training a neural network to do regression on that and predict the values. Given that everything is already hardcoded into the code, I must be getting the same output everytime I run. However, this is not the case. I request you to help me out.

Resources