To improve this tutorial and test other things, I was pretrained the network with a centralized way in EMNIST database. Then I would like to Fine tune the pretrained network with a federated code above.
So, I only added:
def create_keras_model():
return tf.keras.models.Sequential([
tf.keras.models.load_model(path/to/model, compile=False)
tf.keras.layers.Dense(10, kernel_initializer='zeros'),
tf.keras.layers.Softmax(),
])
The problem is that I find same test accuracy values compared to test accuracy values without fine tuning a pretrained network.
Can you please give me solution.
Related
I would like to use lightGBM to do a machine learning model training.
I checked the example at https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/advanced_example.py
I have some questions about the correctness of the code.
(1) What kind models can be created from lightgbm.train() ?
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html
It is a regressor or classifier ?
(2) Why test dataset is also used for training ? How this can assure that the test results are still valid ?
# line 31
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
weight=W_test, free_raw_data=False)
# line 52
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets=lgb_train, # eval training data with test data !!!
feature_name=feature_name,
categorical_feature=[21])
# line 84
y_pred = bst.predict(X_test) # why x_test is also used to predict y? X_test has been involved in training the model !!!
Thanks
You can train both regression and classifier models using lgb.train. It depends on the parameters, which you define, namely objective.
Test set (valid_sets) is used only for validation, it isn't used for training.
I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!
The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".
still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
Hi i am solving a regression problem.My data set consists of 13 features and 550068 rows.I tried different different models and found that boosting algorithms(i.e xgboost,catboost,lightgbm) are performing well on that big data set.here is the code
import lightgbm as lgb
gbm = lgb.LGBMRegressor(objective='regression',num_leaves=100,learning_rate=0.2,n_estimators=1500)
gbm.fit(x_train, y_train,
eval_set=[(x_test, y_test)],
eval_metric='l2_root',
early_stopping_rounds=10)
y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
accuracy = round(gbm.score(x_train, y_train)*100,2)
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
import xgboost as xgb
boost_params = {'eval_metric': 'rmse'}
xgb0 = xgb.XGBRegressor(
max_depth=8,
learning_rate=0.1,
n_estimators=1500,
objective='reg:linear',
gamma=0,
min_child_weight=1,
subsample=1,
colsample_bytree=1,
scale_pos_weight=1,
seed=27,
**boost_params)
xgb0.fit(x_train,y_train)
accuracyxgboost = round(xgb0.score(x_train, y_train)*100,2)
predict_xgboost = xgb0.predict(x_test)
msexgboost = mean_squared_error(y_test,predict_xgboost)
rmsexgboost= np.sqrt(msexgboost)
from catboost import Pool, CatBoostRegressor
train_pool = Pool(x_train, y_train)
cbm0 = CatBoostRegressor(rsm=0.8, depth=7, learning_rate=0.1,
eval_metric='RMSE')
cbm0.fit(train_pool)
test_pool = Pool(x_test)
predict_cat = cbm0.predict(test_pool)
acc_cat = round(cbm0.score(x_train, y_train)*100,2)
msecat = mean_squared_error(y_test,predict_cat)
rmsecat = np.sqrt(msecat)
By using the above models i am getting rmse values about 2850.Now i want to improve my model performance by reducing root mean square error.How can i improve my model performance? As i am new to boosting algorithms,which parameters effect the models?And how can i do hyperparameter tuning for those algorithms(xgboost,catboost,lightgbm).I am using Windows10 os and intel i5 7th genration.
Out of those 3 tools that you have tried CatBoost provides an edge in categorical feature processing (it could be also faster, but I did not see a benchmark demonstrating it, and it seems to be not dominating on kaggle, so most likely it is not as quick as LightGBM, but I might be wrong in that hypothesis). So I would use it if I have many of those in my sample. The other two (LightGBM and XGBoost) provide very similar functionality and I would suggest to choose one of them and stick top it. At the moment it seems that LightGBM outperforms XGBoost in training time on CPU providing a very comparable precision of predictions. See for example GBM-perf beachmark on github or this in-depth analysis. If you have GPU's available, than in fact XGBoost seems to be preferable, judging on the benachmark above.
In general, you can improve your model performance in several ways:
train longer (if early stopping was not triggered, that means that there is still room for generalisation; if it was, then you can not improve further by training longer the chosen model with chosen hyper-parameters)
optimise hyper-parameters (see below)
choose a different model. There is no single silver bullet for all problems. Typically GBMs work very well on large samples of structured data, but for some classes of problems (e.g. linear dependence) it is hard for a GBM to learn how to generalise, as it might require very many splits. So it might be that for your problem a linear model, an SVM or something else will do better out of the box.
Since we narrowed down to 2 options, I can not advice on catboost hyper-parameter optimisation, as I have no hands-on experience with it yet. But for lightgbm tuning you can read this official lightgbm doc and these instructions in one of the issues. There are very many good examples of hyper parameter tuning for LightGBM. I can quickly dig out my kernel on kaggle: see here. I do not claim it to be perfect but that's something what is easy for me to find :)
If you are using Intel CPU, then try Intel XGBoost. Intel has powered several optimizations for XGBoost to accelerate gradient boosting models and improve its training and inference capabilities. Also, please check out the article, https://www.intel.com/content/www/us/en/developer/articles/technical/easy-introduction-xgboost-for-intel-architecture.html#gs.q4c6p6 on how to use XGBoost with Intel optimizations.
You can use either of lasso or ridge, these methods could improve the performance.
For hyper parameter tuning, you can use loops. iterate the values and check where you getting lowest RMSE values.
You can also try stacked ensemble techniques.
If you use R, use h20.ai package, It gives good result.
I've been trying use batch normalization in tensorflow for while with no success.
The training loss converges nicely (better than without the BN), but the test loss remains high throughout the training. I'm using batch size of 1 but the problem still happens with bigger batch size.
what I'm currently doing is:
inputs = tf.layers.batch_normalization(
inputs=inputs, axis=1 if data_format == 'channels_first' else 3,
momentum== 0.997, epsilon=1e-5, center=True,
scale=True, training=is_training, fused=True)
inputs = tf.nn.relu(inputs)
is_training is tf.placeholder that I assign to True during training and False during testing, and for the training op I do this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss, global_step=batch)
I've tried "tf.contrib.layers.batch_norm" too, and several other implementations of BN that I found online. but nothing works.. I always get the same problem.
I know that the beta and gamma variables are being updated during training.
But I also noticed that tf.get_collection(tf.GraphKeys.MOVING_AVERAGE_VARIABLES) is an empty collection, which is weird.
Have anyone seen and solved this problem before?
I can't think of any more things to try.
Note: I know the problem is with the BN because without it the test loss converges with the training less as expected.
I trained a logistic regression classifier in sklearn. My base feature-file has 65 features, now I extrapolated them to a 1000 by considering quadratic combinations also (using PolynomialFeatures()). And then I reduced them back to 100 by Select-K-Best() method.
However, once I have my model trained and I get a new test_file, it would only have the 65 base features but my model expects 100 of them.
So, how can I apply the Select-K-Best() method on my test-set when I do not know the labels which is required in Select-K-Best.fit() function
You shouldn't fit SelectKBest again on test data - use the same (already fit) SelectKBest instance as in training instead. I.e. you should only use .transform method on test data, not .fit method.
scikit-learn provides an utility which makes managing multiple steps like that easier; it is called Pipeline. It should be something like that in your case (via make_pipeline helper):
pipe = make_pipeline(
PolynomialFeatures(2),
SelectKBest(100),
LogisticRegression()
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)