How to merge results from functions executed using foreach in pyspark? - foreach

I am almost new in PySpark and trying to merge the output values of a function executed through foreach in PySpark.
Here is the pseudo-code:
files_rdd = sc.parallelize(files)
files_rdd.foreach(lambda x: training_cart(x, min_leaf=10, pruning=True))
get each CART model and dump it as pickle
Build a list/dictionary where each trained CART is inserted
where,
def training_cart(file, min_leaf=10, pruning=True):
read out file
model = train classification tree
return model
The idea is then to take each CART model and insert it into a list/dictionary to be used after the training phase as well as to dump them as independent pickle files to be used later on. Can anyone give me a hand how to do it?
Thanks!

Related

Copy variable between SPSS Datasets using Syntax

EDITED based on feedback...
Is there a way to copy a variable form one open dataset to another in SPSS? What I have tried is to create a scratch variable that captures the value of the variable, and use that scratch variable in a Compute command into the next dataset:
DATASET ACTIVATE DataSet1.
COMPUTE #IDSratch = ID.
Dataset Activate DataSet2.
Compute ID = #IDScratch.
This fails because activating Dataset2 causes the scratch variable to be dropped from memory.
Match Files and/or STAR JOIN syntax will work for most scenarios but in my case because Dataset1 has many more records than Dataset2 AND there are no matching keys in both datasets, this yields extra records.
My original question was "Is there a simple, direct way of copying a variable between datasets?" and the answer still appears to be that merging the files via syntax is the best/only method if using syntax.
Since SPSS version 21.0, the STAR JOIN command (see documentation here) allows you to use SQL syntax to join datasets. So basically, you could get only the variables you want from each dataset.
Assume your first dataset is called data_1 and has id and var_1a. Your second data is called data_2, has the same id and var_2a; and you just want to pull var_2a to the first dataset. If both datasets are open, you can run:
dataset activate data_1.
STAR JOIN
/SELECT t0.var_1a, t1.var_2a
/FROM * AS t0
/JOIN 'data_2' AS t1
ON t0.id=t1.id
/OUTFILE FILE=*.
The link I provided above has plenty of examples on how to join variables of files that are saved in your computer.

Compare the performance of the models and add annotations to the results

I'm preparing Azure Machine Learning exam and have a question shown below:
You are working on an Azure Machine Learning Experiment.
You have the dataset configured as shown in the following table:
You need to ensure that you can compare the performance of the models and add annotations to the results.
A. You consolidate the output of the Score Model modules by using the Add Rows module and then use the Execute R Script module.
B. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then use the Execute R Script Module.
C. You save the output of the Score Model modules as a combined set, and then use the Project Columns modules to select the MAE.
D. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then save the results as a dataset.
I think all of the above are correct but what confuses me is there are different answers from the internet. Some are the same with me, but others are not. I need someone to confirm my answer or explain to me the correct answer.

ignite: how to update trained model of decision tree with new data points

In the following way i try to update pre-trained decision tree model with new data points, but i'm getting a new model which is completely seems like a model which is build on new data points instead of combined version of trained model plus new data points?.
is anything i missed?.
// setup trainer
DecisionTreeClassificationTrainer trainer =
new DecisionTreeClassificationTrainer(maxDepth, minImpurity);
DatasetBuilder<Integer, double[]> datasetBuilder = new CacheBasedDatasetBuilder<>(ignite, dataCache);
Model mdl = trainer.updateModel(
(DecisionTreeNode) prevMdl,
datasetBuilder,
featureExtractor,
labelExtractor
);
return mdl;
}
For now, the ML module doesn't support updates for decision trees. The problem in a tree structure, we don't come up good approach for branch delete during a model update.
Model update works well for other, non-tree-based algorithms.

Update a trained model in ML.NET

This example shows how to use matrix factorization to build a recommendation system. This example is particulary suitable for a dataset with only two related ids like user id and product id that the corresponding user has purchased.
Based on this example, I prepared an input data like below.
[UserId] [ProductId]
3    1
3    15
3    23
5    9
5    1
8    2
8    1
.
.
And change the column name, making TextLoader.
var reader = ctx.Data.TextReader(new TextLoader.Arguments()
{
Separator = "tab",
HasHeader = true,
Column = new[]
{
new TextLoader.Column("Label", DataKind.R4, 0),
new TextLoader.Column("UserId", DataKind.U4, new [] { new TextLoader.Range(0) }, new KeyRange(0, 100000)),
new TextLoader.Column("ProductId", DataKind.U4, new [] { new TextLoader.Range(1) }, new KeyRange(0, 300))
}
});
It works great. It recommends a list of products that the target user may purchase with individual scores. However, it doesn't work with a new customer data that didn't exist in the initial input data, say UserId 1, it gives score NaN as a result of the prediction.
Retraining the model could be an obvious answer, but it seems futile to retrain the model everytime a new data comes in. I think there's definitely a way to update the existing model but I cannot find the relevant documentation, APIs, or a sample anywhere. I ended up leaving a question in the official github of ML.NET but I've got no answers so far.
Question would be very simple, in a nutshell, how can I update a trained model in ML.NET? Linking a relevant source of information would be greatly appreciated too.
In this particular example because of the task being performed you are limited to the scope of observations the model was trained on and can make predictions on that set. As you mentioned, a good way to go about it would be to re-train. I haven't tried this myself, but you might want to try one of the following:
Run Fit function again using the new data you want to train with as your input. Not only should the model persist it's previous training but also re-train using the additional data you have provided it with.
Save model to file, Load persisted model, Run Fit function like above.
As of 2021:
The re-training process is described in details here: https://learn.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/retrain-model-ml-net

Evaluation error while trying to build model in DSX when dataset has a feature column with unique values

I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.

Resources