Trying to find an optimal value using scipy.optimize.brute.
One of the features of the trained model takes a value between 0 and 55.
I need to find at what value this feature should be assigned for getting predicted value closest to 900. Can someone help me with python code?
from scipy import optimize
target_temper = 900 # Оптимальная температура
x_range = (0, 55)
def predictor(x):
a=xg_reg.predict(x) - target_temper
return np.abs(a)
resbrute = optimize.brute(predictor, x_range, full_output=True, finish=optimize.fmin)
...
Finally figure out the solution!
target_temper = 953 # The optimal value
x_col_name = 'Навеска фторида, кг(t)' # The variable for which I need to iterate over the values from 0 to 66
x_range = (0, 66)
x_step = 0.5
rrange = (slice(x_range[0], x_range[1], x_step),)
def predictor(n):
data_line = X.tail(1)
data_line[variable_column_name] = n
a=xg_reg.predict(data_line) - target_temper
return np.abs(a)
resbrute = optimize.brute(predictor, rrange, full_output=True, finish=optimize.fmin)
optimal_value = resbrute[0]
Related
I am trying to optimize the expected improvement function for Bayesian optimization applications. For this, I am using the scikit-learn Gaussian Process model embedded into the GEKKO optimization suite. When solving the optimization model the following error is shown:
#error: Model Expression
*** Error in syntax of function string: Missing operator
Position: 128
((0.5)((1+(((2/pi))(atan(((((2)((((v1-i320))/(((2)(sqrt(2))))))))((1+(((((v1-i320))/(((2)(sqrt(2))))))^(4))))))))-(0.0)))=0)
The code is below
import numpy as np
import pandas as pd
from gekko import GEKKO
from gekko.ML import Gekko_GPR
from gekko.ML import CustomMinMaxGekkoScaler
import sklearn.gaussian_process as gpr
# Training data
x_train = np.array([0.6, 0.9, 0.3, 0.45, 1.05, 0.75, 0.15,
0.225, 0.825, 1.125]).reshape(-1,1)
y_train = np.array([-0.809016994, 0.809016994, -0.309016994, -0.951056516,
0.951056516, -1.83772E-16, 0.587785252, 0.156434465,
0.4539905, 0.707106781]).reshape(-1,1)
# Additional information
lb = [0.0] # lower bound
ub = [1.2] # upper bound
n_dim = len(lb) # number of dimension
n_train = x_train.shape[0] # size of the training set
# Function to fit the Gaussian process
def gp_fit(data_s, gp_reg):
d_array = data_s.to_numpy()
x_tr = d_array[:,1].reshape(-1,1)
y_tr = d_array[:,-1].reshape(-1,1)
gp_model = gp_reg.fit(x_tr, y_tr)
return gp_model # it delivers the gp model object
# gekko scaler definition
data = pd.DataFrame(np.hstack((x_train, y_train)), columns=['x', 'y'])
features = ['x']
label = ['y']
scaler = CustomMinMaxGekkoScaler(data,features,label)
data_s = scaler.scaledData() # data scaled
# kernel and gp regressor definition
bounds_m = (1e-4, 3000) # bounds for the hyperparameters
kernel_main = gpr.kernels.Matern(length_scale=np.ones(n_dim),
length_scale_bounds=bounds_m,
nu=2.5)
constant_kernel = gpr.kernels.ConstantKernel(1.0, constant_value_bounds=bounds_m)
white_kernel = gpr.kernels.WhiteKernel(1.0, noise_level_bounds=(1.13e-07, 1.83e-02))
K_cov = constant_kernel*kernel_main + white_kernel
gp_regressor = gpr.GaussianProcessRegressor(kernel=K_cov, alpha=1e-8,
optimizer='fmin_l_bfgs_b',
n_restarts_optimizer=50,
random_state=20)
# gp_model creation
gp_model = gp_fit(data_s, gp_regressor) # trainig the model with the data scaled
# gekko model definition and solution
m = GEKKO(remote=False) # model definition
x = m.Var(0.4, lb=0, ub=1) # definition of variables scaled
y, std = Gekko_GPR(gp_model, m).predict(x, return_std=True) # gp prediction with std
# constants
epsilon = m.Const(0.01, 'epsilon')
best_y = m.Const(1.0, 'best_y')
pi_m = m.Const(np.pi, 'pi')
# equations
Z = (y - best_y - epsilon)/std == 0.0
pdf = 1/(std*m.sqrt(2*pi_m))*m.exp(-0.5*((x-y)/std)**2) == 0.0
erf = 2/pi_m*m.atan(2*((x-y)/(2*m.sqrt(2)))*(1+((x-y)/(2*m.sqrt(2)))**4)) == 0.0
cdf = 0.5*(1+erf) == 0
m.Equations([Z, pdf, erf, cdf])
# objective function
ei = Z*std*cdf + std*pdf
m.Maximize(ei)
m.options.IMODE = 3 # steady state optimization
m.solve(disp=True)
I was able to fix your error, but I am unable to get it fully working. Here is what I suggest:
for your objective function and cdf function, you are using Gekko equations from variables like erf. I suggest reformatting some of that with gekko Intermediate values, like so:
# equations
tZ = m.Intermediate((y - best_y - epsilon)/std)
Z = tZ == 0.0
tpdf = m.Intermediate(1/(std*m.sqrt(2*pi_m))*m.exp(-0.5*((x-y)/std)**2))
pdf = tpdf == 0.0
terf = m.Intermediate(2/pi_m*m.atan(2*((x-y)/(2*m.sqrt(2)))*(1+((x-y)/(2*m.sqrt(2)))**4)))
erf = terf == 0.0
tcdf = m.Intermediate(0.5*(1+terf))
cdf = tcdf == 0.0
m.Equations([Z, pdf, erf, cdf])
# objective function
ei = tZ*std*tcdf + std*tpdf
Changing this causes Gekko to throw a "TOO_FEW_DEGREES_OF_FREEDOM" Error, as you are trying to solve 4 equations with 1 variable. I suggest making these equations a soft constraint (trying to minimize them rather than set them to 0) or add additional variables to the problem statement.
I am trying to train a basic SVM model for multiclass text classification in Julia. My dataset has around 75K rows and 2 columns (text and label). The context of the dataset is the abstracts of scientific papers gathered from PubMed. I have 10 labels in the dataset.
The dataset looks like this:
I keep receiving two different Method errors. The starting one is:
ERROR: MethodError: no method matching DocumentTermMatrix(::Vector{String})
I have tried:
convert(Array,data[:,:text])
and also:
convert(Matrix,data[:,:text])
Array conversion gives the same error, and matrix conversion gives:
ERROR: MethodError: no method matching (Matrix)(::Vector{String})
My code is:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
Text = convert(Array,data[:,:text])
m = DocumentTermMatrix(Text)
X = tfidf(m)
return X
end
function Classify(data)
data = ReadData(data)
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
I appreciate your help to solve this problem.
EDIT:
I have managed to get the corpus but now facing DimensionMismatch Error:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
#count = countmap(df.label)
#println(count)
#amt,lesslabel = findmin(count)
#println(amt, lesslabel)
#println(first(df,5))
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
crps = Corpus(StringDocument.(data.text))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X = tf_idf(m)
return X
end
function Classify(data)
data = ReadData(data)
#println(labels)
#println(first(instances))
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
Error:
ERROR: DimensionMismatch("Size of second dimension of training instance\n matrix (247317) does not match length of\n labels (38263)")
(Copying Bogumił Kamiński's solution from the comments, as a community wiki answer, for better visibility.)
The argument to DocumentTermMatrix should be of type Corpus, as in this example.
A Corpus can be created with:
Corpus(StringDocument.(data.text))
There's a DimensionMismatch error after that, which is due to the mismatch between what tf_idf sends and what svmtrain expects. tf_idf's return value has one row per document, whereas svmtrain expects one column per document i.e. expects each column to be an X value. So, performing a permutedims on the result before passing it to svmtrain resolves this mismatch.
I am trying to Hyper-Parameter Tune XGBoostClassifier using Hyperopt. But I am facing a error. Please find below the code that I am using and the error as well:-
Step_1: Objective Function
import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer
MAX_EVALS = 200
N_FOLDS = 10
def objective(params, n_folds = N_FOLDS):
"""Objective function for XGBoost Hyperparameter Optimization"""
# Keep track of evals
global ITERATION
ITERATION += 1
# # Retrieve the subsample if present otherwise set to 1.0
# subsample = params['boosting_type'].get('subsample', 1.0)
# # Extract the boosting type
# params['boosting_type'] = params['boosting_type']['boosting_type']
# params['subsample'] = subsample
# Make sure parameters that need to be integers are integers
for parameter_name in ['max_depth', 'colsample_bytree',
'min_child_weight']:
params[parameter_name] = int(params[parameter_name])
start = timer()
# Perform n_folds cross validation
cv_results = xgb.cv(params, train_set, num_boost_round = 10000,
nfold = n_folds, early_stopping_rounds = 100,
metrics = 'auc', seed = 50)
run_time = timer() - start
# Extract the best score
best_score = np.max(cv_results['auc-mean'])
# Loss must be minimized
loss = 1 - best_score
# Boosting rounds that returned the highest cv score
n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)
# Write to the csv file ('a' means append)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([loss, params, ITERATION, n_estimators,
run_time])
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'iteration': ITERATION,
'estimators': n_estimators, 'train_time': run_time,
'status': STATUS_OK}
I have defined the sample space and the optimization algorithm as well. While running Hyperopt, I am encountering this error below. The error is in the objective function.
Error:KeyError: 'auc-mean'
<ipython-input-62-8d4e97f16929> in objective(params, n_folds)
25 run_time = timer() - start
26 # Extract the best score
---> 27 best_score = np.max(cv_results['auc-mean'])
28 # Loss must be minimized
29 loss = 1 - best_score
First, print cv_results and see which key exists.
In the below example notebook the keys were : 'test-auc-mean' and 'train-auc-mean'
See cell 5 here:
https://www.kaggle.com/tilii7/bayesian-optimization-of-xgboost-parameters
#avvinci is correct. Let me explain it further.
cv_results = xgb.cv(params, train_set, num_boost_round = 10000,
nfold = n_folds, early_stopping_rounds = 100,
metrics = 'auc', seed = 50)
This is xgboost cross validation and it return the evaluation history. The history is essentially a pandas dataframe. The column names in the dataframe depends upon what is being passes as in, train, test and eval.
best_score = np.max(cv_results['auc-mean'])
Here you are looking for the best auc in the evaluation history which are called
'test-auc-mean' and 'train-auc-mean'
as #avvinci suggested. The column name 'auc-mean' does not exists so it throws KeyError. Either you call it train-auc-mean for best auc in training set or test-auc-mean for best auc in test set.
If you are in doubt, just run that cross validation outside and use head on the cv_results.
I need to implement a multi-label image classification model in PyTorch. However my data is not balanced, so I used the WeightedRandomSampler in PyTorch to create a custom dataloader. But when I iterate through the custom dataloader, I get the error : IndexError: list index out of range
Implemented the following code using this link :https://discuss.pytorch.org/t/balanced-sampling-between-classes-with-torchvision-dataloader/2703/3?u=surajsubramanian
def make_weights_for_balanced_classes(images, nclasses):
count = [0] * nclasses
for item in images:
count[item[1]] += 1
weight_per_class = [0.] * nclasses
N = float(sum(count))
for i in range(nclasses):
weight_per_class[i] = N/float(count[i])
weight = [0] * len(images)
for idx, val in enumerate(images):
weight[idx] = weight_per_class[val[1]]
return weight
weights = make_weights_for_balanced_classes(train_dataset.imgs, len(full_dataset.classes))
weights = torch.DoubleTensor(weights)
sampler = WeightedRandomSampler(weights, len(weights))
train_loader = DataLoader(train_dataset, batch_size=4,sampler = sampler, pin_memory=True)
Based on the answer in https://stackoverflow.com/a/60813495/10077354, the following is my updated code. But then too when I create a dataloader :loader = DataLoader(full_dataset, batch_size=4, sampler=sampler), len(loader) returns 1.
class_counts = [1691, 743, 2278, 1271]
num_samples = np.sum(class_counts)
labels = [tag for _,tag in full_dataset.imgs]
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), num_samples)
Thanks a lot in advance !
I included an utility function based on the accepted answer below :
def sampler_(dataset):
dataset_counts = imageCount(dataset)
num_samples = sum(dataset_counts)
labels = [tag for _,tag in dataset]
class_weights = [num_samples/dataset_counts[i] for i in range(n_classes)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
return sampler
The imageCount function finds number of images of each class in the dataset. Each row in the dataset contains the image and the class, so we take the second element in the tuple into consideration.
def imageCount(dataset):
image_count = [0]*(n_classes)
for img in dataset:
image_count[img[1]] += 1
return image_count
That code looks a bit complex... You can try the following:
#Let there be 9 samples and 1 sample in class 0 and 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
Here is an alternative solution:
import numpy as np
from torch.utils.data.sampler import WeightedRandomSampler
counts = np.bincount(y)
labels_weights = 1. / counts
weights = labels_weights[y]
WeightedRandomSampler(weights, len(weights))
where y is a list of labels corresponding to each sample, has shape (n_samples,) and are encoded [0, ..., n_classes].
weights won't add up to 1, which is ok according to the official docs.
I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)