Creating a random forest function - machine-learning

I am trying to create a function that takes a 2-d numpy array (i.e. the data) and data_indices (a list of (train_indices,test_indices) tuples) as input.For each (train_indices,test_indices) tuple in data_indices ---the function should:
Train a new RandomForestRegressor model on the portion of data indexed by train_indices
Evaluate the trained RandomForestRegressor model on the portion of data indexed by test_indices using the mean squared error.
After training and evalating the RandomForestRegressor models, the function should return the RandomForestRegressor model that obtained highest testing set mean_square_error over its allocated data split across all trained models.
The trained RandomForestRegressor models should be trained with random_state equal 42, all other parameters should be left as default.
This is how I tried to do it pandas:
def best_k_model(data,data_indices):
model = RandomForestRegressor(random_state=42)
for train_indices, test_indices in data:
X_train, y_train = data[train_indices,0], data[train_indices,1]
X_test, y_test = data[test_indices,0], data[test_indices,1]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
return MSE.max()
After this inputs:
best_model = best_k_model(data,data_indices)
best_model.predict([[1960]])
this is what i am suppose to get:
array([8.85170916e+08])
But I'm getting the error that says:
index 1960 is out of bounds for axis 0 with size 58

Related

How to extract model after using sklearn.RepeatedStratifiedKFold()

I have following code:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
model = XGBClassifier()
scores = cross_val_score(model, X_train, y_train, scoring='f1', cv=cv, n_jobs=2)
i want to test my model on the test set:
model.predict(x_test)
, which raises an error: XGBoostError: need to call fit or load_model beforehand.
How do i get the trained model after k-fold validation? Which function should i use? scoree_val_scores return only scores on each split, but not the trained model, which i need for further evaluation.

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

How to Build a Decision tree Regressor model

I am learning ML and was doing a simple handsOn as below:
//
Split boston.data into two sets names x_train and x_test. Also, split boston.target into two sets y_train and y_test.
Build a Decision tree Regressor model from x_train set, with default parameters.
//
I did following code for this:
from sklearn import datasets, model_selection, tree
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston.data,boston.target, random_state=30)
dt = tree.DecisionTreeRegressor()
dt_reg = dt.fit(x_train)
When I am doing above, it's giving:
TypeError: fit() missing 1 required positional argument: 'y'
Can I fit a model for one training dataset?
What should I give here as 'y'?
As the error states, the fit() method takes 2 parameters for a regression problem, the predictors and the outcome:
dt_reg = dt.fit(x_train, y_train)
Supervised learning models such as the regression tree you are using require a set of observations composed of features (each row of X_train can be understood as a vector containing features for one observation) and a target outcome (each element in the vector y_train)

What does "ValueError: When feeding symbolic tensors to a model, we expect the tensors to have a static batch size" mean?

I am using keras to build a neural network for predicting diabetes. However I encountered a ValueError: When feeding symbolic tensors to a model, we expect the tensors to have a static batch size.
I tried changing the input shapes but I am still stuck.
num_classes = 2
from keras.layers import Input, Dense
from keras.models import Model
# This returns a tensor
inputs = Input(shape=(784,))
# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='sigmoid')(x)
# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x,y) # starts training
After running
ValueError: When feeding symbolic tensors to a model, we expect the tensors to have a static batch size.
Because of these lines x is a Layer object
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
The model should be fitted on actual data but instead you pass in a Layer object:
model.fit(x,y) # starts training
To simply put it your x, which is a Layer object, is a symbolic tensor and keras tries to treat it as a data tensor but fails.
To fix this just make sure that the x that you're passing in is indeed your x training data.
Because x is not the training data when you feed the model (x,y), I fix your code as following:
model.fit(x_train,y_train) # starts training

How to get support vector number after cross validation

Here is my code for digit classification using non linear SVM. I apply a cross validaton scheme to select the hyperparameter c and gamma. But, the model returned by GridSearch have not a n_support_ attribute to get the number of support vectors.
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
#Intilize an svm estimator
estimator=SVC(kernel='rbf',C=1,gamma=1)
#Choose cross validation iterator.
cv = ShuffleSplit(X_train.shape[0], n_iter=5, test_size=0.2, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4,1,2,10],
'C': [1, 10, 50, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf=GridSearchCV(estimator=estimator, cv=cv, param_grid=tuned_parameters)
#begin the cross-validation task to get the best model with best parameters.
#After this task, we get a clf as a best model with best parameters C and gamma.
clf.fit(X_train,y_train)
print()
print ("Best parameters: ")
print(clf.get_params)
print("error test set with clf1",clf.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
#It does not work. So, how can I recuperate the number of vector support?
print ("Number of support vectors by class", clf.n_support_);
**##Here is my methods. I train a new SVM object with the best parameters and I remark that it gate the same test and train error as clf**
clf2=SVC(C=10,gamma= 0.001);
clf2.fit(X_train,y_train)
print("error test set with clf2 ",clf2.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
print clf2.n_support_
Any comment if my proposed method is right?
GridSearchCV will fit a number of models. You can get the best one with clf.best_estimator_ so to find the indices of the support vectors in your training set you can use clf.best_estimator_.n_support_, and of course len(clf.best_estimator_.n_support_) will give you the number of support vectors.
You can also get the parameters and the score of the best model with clf.best_params_ and clf.best_score_ respectively.

Resources