Accelerated Failure Time model in PySpark to model recurring events - machine-learning

I am trying to predict the probabilities for a customer reordering an order cart from their order history by making use of Accelerated Failure Time model in PySpark.
My input data contains
various features of the customer and the respective order cart as predictors
days between two consecutive orders as label and
previously observed orders as uncensored and future orders as censored.
PySpark is the choice here as there are some restrictions on the environment and I don't have another alternative to process the huge volumes of order history(~40 GB). Here's my sample implementation:
> from pyspark.ml.regression import AFTSurvivalRegression from
> pyspark.ml.linalg import Vectors
>
> training = spark.createDataFrame([
> (1,1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (1,2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (2,3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (2,0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (3,4.199, 0.0, Vectors.dense(0.795, -0.226))], ["customer_id","label", "censor", "features"]) aft =
> AFTSurvivalRegression()
>
> model = aft.fit(training)
Questions:
Does AFTSurvivalRegression method in pyspark.ml.regression have the ability to cluster the records in my dataset based on the customer id? If so, please explain how to implement?
The desired output would contain the probabilities of a particular customer reordering different order-carts. How can I obtain these values by extending my code implementation?

Related

Neo4j Link prediction ML Pipeline

I am working on a use case predict relation between nodes but of different type.
I have a graph something like this.
(:customer)-[:has]->(:session) (:session)-[:contains]->(:order) (:order)-[:has]->(:product) (:order)-[:to]->(:relation)
There are many customers who have placed orders. Some of the orders specify to whom the order was intended to (relation) i.e., mother/father etc. and some orders do not. For these orders my intention is to predict to whom the order was likely intended to.
I have prepared a Link Prediction ML pipeline on neo4j. The gds.beta.pipeline.linkPrediction.predict.mutate procedure has 2 ways of prediction: Exhaustive search, Approximate search. The first one predicts for all unconnected nodes and the second one applies KNN to predict. I do not want both; rather I want the model to predict the link only between 2 specific nodes 'order' node and 'relation' node. How do I specify this in the predict procedure?
You can also frame this problem as node classification and get what you are looking for. Treat Relation as the target variable and it will become a multi class classification problem. Let's say that Relation is a categorical variable with a few types (Mother/Father/Sibling/Friend etc.) and the hypothesis is that based on the properties on the Customer and the Order nodes, we can predict which relation a certain order is intended to.
Some of the examples of the properties of Customer nodes are age, location, billing address etc., and the properties of the Order nodes are category, description, shipped address, billing address, location etc. Properties of Session nodes are probably not useful in predicting anything about the order or the relation that order is intended to.
For running any algorithm in Neo4j, we have to project a graph into memory. Some of the properties on Customer and Order nodes are strings and graph projection procs do not support projecting strings into memory. Hence, the strings have to be converted into numerical values.
For example, Customer age can be used as is but the order description has to be converted into a word/phrase embedding using some NLP methodology etc. Some creative feature engineering also helps - instead of encoding billing/shipping addresses, a simple flag to identify if they are the same or different makes it easier to differentiate if the customer is shipping the order to his/her own address or to somewhere else.
Since we are using Relation as a target variable, let's label encode the relation type and add that as a class label property on Order nodes where relationship to Relation node exists (labelled examples). For all other orders, add a class label property as 0 (or any other number other than the label encoded relation type)
Now, project a graph with Customer, Session and Order nodes along with the properties of interest into memory. Since we are not using Session nodes in our prediction task, we can collapse the path between Customer and Order nodes. One customer can connect to multiple orders via multiple session nodes and orders are unique. Collapse path procedure will not result in multiple relationships between a customer and an order node and hence, aggregation is not needed.
You can now use Node classification ML pipeline in Neo4j GDS library to generate embeddings and use embedding property on Order node as a feature vector and class label property as target and train a multi class classification model to predict the class that particular order belongs to or the likelihood that particular order is intended to some relation type.
This use case is not supported by the latest stable release of GDS (2.1.11, at the time of writing). In GDS pipelines, we assume a homogeneous graph, where the training algorithm will consider each node as the same type as any other node, and similarly for relationships.
However, we are currently building features to support heterogeneous use cases. In 2.2 we will add so-called context configuration, where you can direct your training algorithm to attempt to learn only a specific relationship type between specific source and target node labels, while still allowing the feature-producing node property steps to use the richer graph.
This will be effective relative to the node features you are using -- if you are using an embedding, you must know that these are still homogeneous and will not likely be able to tell the various different relationship types apart (except for GraphSAGE). Even if you do use them, you will only get the predictions for the relevant label-type-label triple which you specified for training. But I would recommend to think about what features to use and how to tune your models effectively.
You can already try out the 2.2 features by using our alpha releases -- find our latest alpha through this download link. Preview documentation is available here. Note that this is preview software and that the API may change a lot until the final 2.2.0 released version.

Choosing Content based recommendation model for prediction

I am trying to build a content based recommendation model but I am stuck on how to proceed with which algorithm to choose from. Basically, my features are user_id, user_age, gender, location, show_id, show_type(eg: series, movies etc.), show_duration, user_watched_duration, genre, rating. And my model has to predict top 5 recommendation shows for certain users.
The proportion of users to shows is hugh. Like there are only around 10k users but the shows that are mapped to each user is approx 150. So each user has an history of 150 shows mapped to them on an average. So total records are 10k x 150 = 15,00,000
Now, I am confused like which algorithm to proceed with this scenario. I read content based method is the ideal for my scenario. But when I checked SVD from surprise library, it is only taking 3 features as input DataSet - "user_id", "item_id", "rating" and fitting the model. But I have to consider fitting other features like user_watched_duration out of the show_duration and give preference if he/she has fully watched the show. And similarly, I want the model to consider recommending based on the Gender and age also. Like for example, for young men, (male < 20 years old) have watched a show and given a higher rating, then my recommendation to user's of similar category of young men should be given.
Can I use to train a normal classifical model like KNN for this? I tried to think of using sparse matrix using csr_matrix with row consisting of user_id and col consisting of show_id. And then transposing using (user_show_matrix.T * user_show_matrix) , so that I can use this to get counts of shows watched for that particular user. But the problem with this approach is that I cannot map other features with this, right?
So please suggest how to proceed. I already did data cleaning, label encoded categories etc. Will I be able to use any classification algorithms for this? Appreciate any references on similar approaches. Thank you!

Excluding columns from training data in BigQuery ML

I am training a model using BigQuery ML, my input has several fields, one of which is a customer number, this number is not useful as a prediction feature, but I do need it in the final output so that I can reference which users scored high vs. low. How can I exclude this column from the model training without removing it completely?
Reading the docs the only way I can see to exclude columns is by adding it to input_label_cols which it clearly is not, or data_split_col which is not desirable.
You do not need include into model fields that not need to be part of model - not at all.
Rather, you need to include them during the prediction
For example in below model you have only 6 fields as input (carrier, origin, dest, dep_delay, taxi_out, distance)
#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
While in prediction you can have all extra fields available, like below (and you can put them in any position of SELECT - but note - predicted columns will go first:
#standardsql
SELECT * FROM ml.PREDICT(MODEL `cloud-training-demos.flights.ontime`, (
SELECT
UNIQUE_CARRIER, -- extra column
ORIGIN_AIRPORT_ID, -- extra column
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
LIMIT 5
))
Obviously input_label_cols and data_split_col are for different purposes
input_label_cols STRING The label column name(s) in the training data.
data_split_col STRING This option identifies the column used to split the data [into training and evaluation sets]. This column cannot be used as a feature or label, and will be excluded from features automatically.

Evaluation error while trying to build model in DSX when dataset has a feature column with unique values

I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.

Recommendations without ratings (Azure ML)

I'm trying to build an experiment to create recommendations (using the Movie Ratings sample database), but without using the ratings. I simply consider that if a user has rated certain movies, then he would be interested by other movies that have been rated by users that have also rated his movies.
I can consider, for instance, that ratings are 1 (exists in the database) or 0 (does not exist), but in that case, how do I transform the initial data to reflect this?
I couldn't find any kind of examples or tutorials about this kind of scenario, and I don't really know how to proceed. Should I transform the data before injecting it into an algorithm? And/or is there any kind of specific algorithm that I should use?
If you're hoping to use the Matchbox Recommender in AML, you're correct that you need to identify some user-movie pairs that are not present in the raw dataset, and add these in with a rating of zero. (I'll assume that you have already set all of the real user-movie pairs to have a rating of one, as you described above.)
I would recommend generating some random candidate pairs and confirming their absence from the training data in an Execute R (or Python) Script module. I don't know the names of your dataset's features, but here is some pseudocode in R to do that:
library(dplyr)
df <- maml.mapInputPort(1) # input dataset of observed user-movie pairs
all_movies <- unique(df[['movie']])
all_users <- unique(df[['user']])
n <- 30 # number of random pairs to start with
negative_observations <- data.frame(movie = sample(all_movies, n, replace=TRUE),
user = sample(all_users, n, replace=TRUE),
rating = rep(0, n))
acceptable_negative_observations <- anti_join(unique(negative_observations), df, by=c('movie', 'user'))
df <- rbind(df, acceptable_negative_observations)
maml.mapOutputPort("df");
Alternatively, you could try a method like association rule learning which would not require you to add in the fake zero ratings. Martin Machac has posted a nice example of how to do this in R/AML in the Cortana Intelligence Gallery.

Resources