I have trying to run XGBoost for time series analysis. these are my codes which are used else where
xgb1 = xgb.XGBRegressor(learning_rate=0.1,n_estimators=n_estimators,max_depth=max_depth,min_child_weight=min_child_weight,gamma=0,subsample=0.8,colsample_bytree=0.8,
reg_alpha=reg_alpha,objective='reg:squarederror', nthread=4, scale_pos_weight=1, seed=27)
xgb_param = xgb1.get_xgb_params()
dmatrix = xgb.DMatrix(data=X_train, label=y_train)
cv_folds = 5
early_stopping_rounds = 50
cvresults = xgb.cv(dtrain=dmatrix, params = xgb_param,num_boost_round=xgb1.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
Obvious issue here is that I want to cross validate timeseries data and hence can't use the cv_folds = 5.
(How) can I use the TimeseriesSplit function within xgb.cv?
thanks,
Related
I'm running a machine learning model that requires multiple transformations. I applied polynomial transformations, interactions, and also a feature selection using SelectKBest:
transformer = ColumnTransformer(
transformers=[("cat", ce.cat_boost.CatBoostEncoder(y_train), cat_features),]
)
X_train_transformed = transformer.fit_transform(X_train, y_train)
X_test_transformed = transformer.transform(X_test)
poly = PolynomialFeatures(2)
X_train_polynomial = poly.fit_transform(X_train_transformed)
X_test_polynomial = poly.transform(X_test_transformed)
interaction = PolynomialFeatures(2, interaction_only=True)
X_train_interaction = interaction.fit_transform(X_train_polynomial)
X_test_interaction = interaction.transform(X_test_polynomial)
feature_selection = SelectKBest(chi2, k=55)
train_features = feature_selection.fit_transform(X_train_interaction, y_train)
test_features = feature_selection.transform(X_test_interaction)
model = lgb.LGBMClassifier()
model.fit(train_features, y_train)
However, I want to get the feature names and I have no idea on how to get them.
My problem is time series anomaly detection and I use facebook prophet library. So I have a function called "fit_predict_model" and I have 90 different dataframes that I keep in the dictionary. I mean have 90 different models. Then it takes a long time to train. I wanted to use multiprocessing to train faster.But I am getting memory error. How can I solve this problem?
def fit_predict_model(dataframe, model_name, interval_width = 0.95, changepoint_range = 0.88):
model = Prophet(yearly_seasonality=False,daily_seasonality=True,
seasonality_mode = "multiplicative",changepoint_range = changepoint_range)
model = model.fit(dataframe)
forecast = model.predict(forecast)
return forecast
pred = {}
def run(key):
pred[key] = fit_predict_model(train[key], model_name = key)
pool = Pool(cpu_count())
pool.map(run, list(train.keys()))
pool.close()
pool.join()
I am working on a Binary Classification Machine Learning Problem and I am trying to balance the training set as I have an imbalanced target class variable. I am using Py-Spark for building the model.
Below is the code which is working to balance the data
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
label count
0 0.0 712980
1 1.0 2926
train_new = train_initial.sampleBy('label', fractions={0: 2926./712980, 1: 1.0}).cache()
The above code performs under-sampling, but I think this might lead to loss of information. However, I am not sure how to perform upsampling. I also tried to use sample function as below:
train_up = train_initial.sample(True, 10.0, seed = 2018)
Although, it increases the count of 1 in my data set, it also increases the count of 0 and gives the below result.
label count
0 0.0 7128722
1 1.0 29024
Can someone please help me to achieve up-sampling in py-spark.
Thanks a lot in Advance!!
The problem is that you are oversampling the whole data frame. You should filter the data from the two classes
df_class_0 = df_train[df_train['label'] == 0]
df_class_1 = df_train[df_train['label'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)
the example comes from : https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Please note that there are better way to perform oversampling (e.g. SMOTE)
For anyone trying to do random oversampling on a imbalanced dataset in pyspark. The following code will get you started (in this snippet 0 is the mayority class , and 1 is the class to be oversampled):
df_a = df.filter(df['label'] == 0)
df_b = df.filter(df['label'] == 1)
a_count = df_a.count()
b_count = df_b.count()
ratio = a_count / b_count
df_b_overampled = df_b.sample(withReplacement=True, fraction=ratio, seed=1)
df = df_a.unionAll(df_b_oversampled)
I might be quite late to the rescue here. But this is what I would recommend:
Step 1. Sample only for label = 1
train_1= train_initial.where(col('label')==1).sample(True, 10.0, seed = 2018)
step 2. Merge this data with label = 0 data
train_0=train_initial.where(col('label')==0)
train_final = train_0.union(train_1)
PS: please import the col with
from pyspark.sql.functions import col
I have an acceptable model, but I would like to improve it by adjusting its parameters in Spark ML Pipeline with CrossValidator and ParamGridBuilder.
As an Estimator I will place the existing pipeline.
In ParamMaps I would not know what to put, I do not understand it.
As Evaluator I will use the RegressionEvaluator already created previously.
I'm going to do it for 5 folds, with a list of 10 different depth values in the tree.
How can I select and show the best model for the lowest RMSE?
ACTUAL example:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
dt = DecisionTreeRegressor()
dt.setPredictionCol("Predicted_PE")
dt.setMaxBins(100)
dt.setFeaturesCol("features")
dt.setLabelCol("PE")
dt.setMaxDepth(8)
pipeline = Pipeline(stages=[vectorizer, dt])
model = pipeline.fit(trainingSetDF)
regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse")
rmse = regEval.evaluate(predictions)
print("Root Mean Squared Error: %.2f" % rmse)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60
NEED:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dt2 = DecisionTreeRegressor()
dt2.setPredictionCol("Predicted_PE")
dt2.setMaxBins(100)
dt2.setFeaturesCol("features")
dt2.setLabelCol("PE")
dt2.setMaxDepth(10)
pipeline2 = Pipeline(stages=[vectorizer, dt2])
model2 = pipeline2.fit(trainingSetDF)
regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse")
paramGrid = ParamGridBuilder().build() # ??????
crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ?????
rmse2 = regEval2.evaluate(predictions)
#bestPipeline = ????
#bestLRModel = ????
#bestParams = ????
print("Root Mean Squared Error: %.2f" % rmse2)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60 # the same ¿?
You need to call .fit() with your training data on the crossval object to create the cv model. That will do the cross validation. Then you get the best model (according to your evaluator metric) from that. Eg.
cvModel = crossval.fit(trainingData)
myBestModel = cvModel.bestModel
General question on the speed of (rCharts) highcharts rendering.
Given the following code
rm(list = ls())
require(rCharts)
set.seed(2)
time_stamp<-seq(from=as.POSIXct("2014-05-20 01:00",tz=""),to=as.POSIXct("2014-05-22 20:00",tz=""),by="1 min")
Data1<-abs(rnorm(length(time_stamp))*50)
Data2<-rnorm(length(time_stamp))
time<-as.numeric(time_stamp)*1000
CombData=data.frame(time,Data1,Data2)
CombData$Data1=round(CombData$Data1,2);CombData$Data2=round(CombData$Data2,2);
HCGraph <- Highcharts$new()
HCGraph$yAxis(list(list(title = list(text = 'Data1')),
list(title = list(text = 'Data2'),
opposite =TRUE)))
HCGraph$series(data = toJSONArray2(CombData[,c('time','Data1')], json = F, names = F),enableMouseTracking=FALSE,shadow=FALSE,name = "Data1",type = "line")
HCGraph$series(data = toJSONArray2(CombData[,c('time','Data2')], json = F, names = F),enableMouseTracking=FALSE,shadow=FALSE,name = "Data2",type = "line",yAxis=1)
HCGraph$xAxis(type = "datetime"); HCGraph$chart(zoomType = "x")
HCGraph$plotOptions(column=list(animation=FALSE),shadow=FALSE,line=list(marker=list(enabled=FALSE)));
HCGraph
Produces a highcharts graph of 2 series each 4021 points in length and renders immediately.
However, if I increase the timespan to say 10 days (8341 points) the resulting plot can take several minutes to generate.
I'm aware there are several modifications that can be made to highcharts for better performance,
Highcharts Performance Enhancement Method?,
however, are there any changes I can make from an R / rCharts perspective to speed up performance?
Cheers