Which classification algorithms to use with nominal data? - machine-learning

I work with a dataset which stores business tasks and for each it mainly keeps track of some nominal data such as the Department, Employee assigned, Product, etc.
Past completed task have a completion time.
My goal is, based on the completion time of past tasks, to be able to predict the completion time of a newly arrived task based on its attributes (Department, Employee assigned, Product, etc).
Given that basically all of the attributes of a task are nominal and will be encoded/one-hot-encoded, which would be the best algorithm to use in this case.

Related

How to fine-tune the number of cluster in k-means clustering and incremental way of building the model using BigQuery ML

Used K-means clustering Model for detecting anomaly using BigQuery ML.
Datasets information
date Date
trade_id INT
trade_name STRING
agent_id INT
agent_name String
total_item INT
Mapping - One trade has multiple agent based on date.
Model Trained with below information by sum(total_iteam)
trade_id
trade_name
agent_id
agent_name
Number of cluster: 4
Need to find the anomaly for each trades and agent based on date.
Model is trained with given set of data and distance_from_closest_centroid is calculated. for each trade and agent based on date distance is called. Rightest distance is consider as a anomaly. Using this information
Questions
1. How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).
Questions
2. How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis.
As the question was updated, I will sum up our discussion as an answer to further contribute to the community.
Regarding your first question "How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).".
According to the documentation, if you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. However, if you want to select the most optimal number, you can perform hyperarameter tunning, which is the process of selecting one (or a set) of optimal hyperparameter for a learning algorithm, in your case K-means within BigQuery ML. In order to determine the ideal number of clusters you would run CREATE MODEL query for different values of num_clusters. Then , finding the error measure and select the point which it is at the minimum value. You can select the error measure in the training tab Evaluation, it will show the Davies–Bouldin index and Mean squared distance.
Your second question was "How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis."
K-means is an unsupervised leaning algorithm. So you will train your model with your current data. Then store it in a data set. This model is already trained and can certainly be used with new data, using the ML.PREDICT. So it will use the model to predict which clusters the new data belong to.
As a bonus information, I would like to share this link for the documentation which explains how K-means in BigQuery ML can be used to detect data anomaly.
UPDATE:
Regarding your question about retraining the model:
Question: "I want to rebuild the model because new trade information has to be updated in my existing model. In this case is this possible to append the model with only two month of data or should we need to rebuild the entire model?"
Answer: You would have to retrain the whole model if new relevant data arrives. There is not the possibility to append the model with only two months of new data. Although, I must mention that you can and should use warm_start to retrain your already existing model, here.
As per #Alexandre Moraes
omiting the num_clusters using K-means, BigQuery ML will choose a reasonable amount based in the number of rows in the training data. In addition, you can also use hyperparameter tuning to determine a optimal number of clusters. Thus, you would have to run the CREATE MODEL query for different values of num_clusters, find the error measure and pick the point which the error is minimum, link. –

How to include variable attributes in Machine Learning models?

What machine learning techniques can be used to make a model if some attributes change over time? For example predicting prices of a hotel depends on the number of tourists in the city which is time dependent i.e. it changes from time to time.
Also, if we have a good trained model on some static data, then what are the ways to update the model if some data is changed except retraining the model on complete data again?
Regarding the first question, I would just add a feature indicating time. For instance, hotel X will appear in few data records, each one differs in the value of it's "Month" feature (the data-point of August might have an higher price from the one of December). This way the model will take into consideration the time of the year.
Regarding the second question, unless you're using reinforcement learning / online learning, which is used to train models from an oncoming sequences of samples, I don't see a way to change the data without having the train to model again.

Incorporating prior knowledge to machine learning models

Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.

Any Statistical or Machine Learning Method to Predict Salary

I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).

Splitting data set into training and testing sets on recommender systems

I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.
I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).
My data set has the following structure:
user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...
It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.
Considering this scenario, how should I do the splitting?
You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.
I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.
Select one observation for each user
For each item that has only one observation and has not already been selected from the previous step select one observation
Merge the results of the previous two steps together.
This should produce a set of observations that covers all of the users and all of the items.
Calculate how many observations you need to fill your training set (generally 80% of the total number of observations)
Calculate how many observations are in the merged set from step 3.
The difference between steps 4 and 5 is the number of remaining observations necessary to fill the training set.
Randomly select enough of the remaining observations to fill the training set.
Merge the sets from step 3 and 6: this is your training set.
The remaining observations is your testing set.
As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.
You're quite correct in your basic logic: if you have only one observation in a class, you must include that in the training set for the model to have any validity in that class.
However, dividing the input into these classes depends on the interactions among various observations. Can you identify classes of data, such as the "only rating" issue you mentioned? As you find other small classes, you'll also need to ensure that you have enough of those observations in your training data.
Unfortunately, this is a process that's tricky to automate. Most one-time applications simply have to hand-pick those observations from the data, and then distribute the others per normal divisions. This does have a problem that the special cases are over-represented in the training set, which can detract somewhat from the normal cases in training the model.
Do you have the capability of tuning the model as you encounter later data? This is generally the best way to handle sparse classes of input.
collaborative filtering (matrix factorization) can't have a good recommendation for an unseen user with no feedback. Nevertheless, an evaluation should consider this case and take it into account.
One thing you can do is to report performance for all test users, just test users with some feedback and just unseen users with no feedback.
So I'd say keep the test, train split random but evaluate separately for unseen users.
More info here.

Resources