Does ML pipeline really accelerate the training process? - machine-learning

The following is a code for a ML pipeline that I created. When I perform the same task without a pipeline, the training time was less than that while deploying the pipeline. Why? It is supposed that using ML pipelines should accelerate the training process.
spark = SparkSession.builder.config("spark.driver.memory", "15g").appName('SDN
Data').getOrCreate()
df = (spark.read.format("csv")\
.option('header', 'true')\
.option("inferSchema", "true")\
.load("D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled
with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv"))
allFeatures=df.columns
df_1=df.distinct()
trainDF,testDF = df_1.randomSplit([0.7, 0.3])
imputer = Imputer(inputCols=allFeatures,outputCols= allFeatures)
vec_assembler =
VectorAssembler(inputCols=allFeatures,outputCol="features",handleInvalid="skip")
selector = VarianceThresholdSelector(varianceThreshold=0.0,
featuresCol='features',outputCol="selected_Features")
Scalerizer=StandardScaler().setInputCol("selected_Features").setOutputCol("Scaled_features")
indexer = StringIndexer(inputCol = "L7Protocol", outputCol = "label")
rf = RandomForestClassifier(labelCol="label",featuresCol="Scaled_features",numTrees =
200,maxDepth = 8,maxBins = 32)
MyStages=[imputer, vec_assembler, selector,Scalerizer, indexer, rf]
pipeline = Pipeline(stages=MyStages)
pModel = pipeline.fit(trainDF)
#Wall time: 28min 47s
#Wall time: 27min 47s, 18/8/2022, 13:25
rf_predictions = pModel.transform(testDF)

Related

How can I get the feature names after several fit_transform's from sklearn?

I'm running a machine learning model that requires multiple transformations. I applied polynomial transformations, interactions, and also a feature selection using SelectKBest:
transformer = ColumnTransformer(
transformers=[("cat", ce.cat_boost.CatBoostEncoder(y_train), cat_features),]
)
X_train_transformed = transformer.fit_transform(X_train, y_train)
X_test_transformed = transformer.transform(X_test)
poly = PolynomialFeatures(2)
X_train_polynomial = poly.fit_transform(X_train_transformed)
X_test_polynomial = poly.transform(X_test_transformed)
interaction = PolynomialFeatures(2, interaction_only=True)
X_train_interaction = interaction.fit_transform(X_train_polynomial)
X_test_interaction = interaction.transform(X_test_polynomial)
feature_selection = SelectKBest(chi2, k=55)
train_features = feature_selection.fit_transform(X_train_interaction, y_train)
test_features = feature_selection.transform(X_test_interaction)
model = lgb.LGBMClassifier()
model.fit(train_features, y_train)
However, I want to get the feature names and I have no idea on how to get them.

Do we need to save and load pipeline and model separately in Pyspark ML?

I am doing steps such as data engineering and building some columns , Below I am building pipeline for our Spark ML model
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(
inputCol=categoricalCol, outputCol=categoricalCol + "Index"
)
encoder = OneHotEncoderEstimator(
inputCols=[stringIndexer.getOutputCol()],
outputCols=[categoricalCol + "classVec"],
)
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol="BSConfirmBuy", outputCol="label")
stages += [label_stringIdx]
numericCols = new_col_array
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(new_df1)
new_df1 = pipelineModel.transform(new_df1)
##Model
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
In the above steps I was able to do pipeline and preidct using gbtree.
Now I am saving
#Save pipeline
pipelineModel.write().overwrite().save("s3://data-production/pipelineModel_v1")
#Save Model
gbtModel.save("s3://data-production/first_trade.model_v0")
Now in production /future datasets, Do I need to load both pipeline and model ?
pipelineModel = PipelineModel.load("s3://data-production/pipelineModel_v1")
new_test= pipelineModel.transform(new_df1)
model = GBTClassifier.load("s3://data-production/first_trade.model_v0")
pred = gbtModel.transform(new_test)
Where new_test is future/production dataset

how do I use TimeSeriesSplit in xgb.cv

I have trying to run XGBoost for time series analysis. these are my codes which are used else where
xgb1 = xgb.XGBRegressor(learning_rate=0.1,n_estimators=n_estimators,max_depth=max_depth,min_child_weight=min_child_weight,gamma=0,subsample=0.8,colsample_bytree=0.8,
reg_alpha=reg_alpha,objective='reg:squarederror', nthread=4, scale_pos_weight=1, seed=27)
xgb_param = xgb1.get_xgb_params()
dmatrix = xgb.DMatrix(data=X_train, label=y_train)
cv_folds = 5
early_stopping_rounds = 50
cvresults = xgb.cv(dtrain=dmatrix, params = xgb_param,num_boost_round=xgb1.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
Obvious issue here is that I want to cross validate timeseries data and hence can't use the cv_folds = 5.
(How) can I use the TimeseriesSplit function within xgb.cv?
thanks,

Error on tuning parameters using classif.svm in mlr3

I'm using the mlr3 to build a machine learning workflow using SVM classfier. When I try to tune the parameter
library(mlr3)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
task = tsk("pima")
learner = lrn("classif.svm")
learner$param_set
tune_ps = ParamSet$new(list(
ParamDbl$new("cost", lower = 0.001, upper = 0.1)
))
tune_ps
hout = rsmp("holdout")
measure = msr("classif.ce")
evals20 = term("evals", n_evals = 20)
instance = TuningInstance$new(
task = task,
learner = learner,
resampling = hout,
measures = measure,
param_set = tune_ps,
terminator = evals20
)
tuner = tnr("grid_search", resolution = 10)
result<-tuner$tune(instance)
It outputs the error
Error in (function (xs) :
Assertion on 'xs' failed: Condition for 'cost' not ok: type equal C-classification; instead: type=
I can't figure out what is happening there.
We decided to solve this with a more descriptive error message but still requiring to set parameters with dependencies explicitly in the ParamSet rather than falling back to ParamSet defaults.
See https://github.com/mlr-org/paradox/pull/262 and related issues/PRs for more information.

Find the best pipeline model using CrossValidator and ParamGridBuilder

I have an acceptable model, but I would like to improve it by adjusting its parameters in Spark ML Pipeline with CrossValidator and ParamGridBuilder.
As an Estimator I will place the existing pipeline.
In ParamMaps I would not know what to put, I do not understand it.
As Evaluator I will use the RegressionEvaluator already created previously.
I'm going to do it for 5 folds, with a list of 10 different depth values in the tree.
How can I select and show the best model for the lowest RMSE?
ACTUAL example:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
dt = DecisionTreeRegressor()
dt.setPredictionCol("Predicted_PE")
dt.setMaxBins(100)
dt.setFeaturesCol("features")
dt.setLabelCol("PE")
dt.setMaxDepth(8)
pipeline = Pipeline(stages=[vectorizer, dt])
model = pipeline.fit(trainingSetDF)
regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse")
rmse = regEval.evaluate(predictions)
print("Root Mean Squared Error: %.2f" % rmse)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60
NEED:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dt2 = DecisionTreeRegressor()
dt2.setPredictionCol("Predicted_PE")
dt2.setMaxBins(100)
dt2.setFeaturesCol("features")
dt2.setLabelCol("PE")
dt2.setMaxDepth(10)
pipeline2 = Pipeline(stages=[vectorizer, dt2])
model2 = pipeline2.fit(trainingSetDF)
regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse")
paramGrid = ParamGridBuilder().build() # ??????
crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ?????
rmse2 = regEval2.evaluate(predictions)
#bestPipeline = ????
#bestLRModel = ????
#bestParams = ????
print("Root Mean Squared Error: %.2f" % rmse2)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60 # the same ¿?
You need to call .fit() with your training data on the crossval object to create the cv model. That will do the cross validation. Then you get the best model (according to your evaluator metric) from that. Eg.
cvModel = crossval.fit(trainingData)
myBestModel = cvModel.bestModel

Resources