Method to cluster data that also has classification labels? - machine-learning

I have a dataset in which each line represents a person and their payment behavior during a full year. For each person I have 3 possible classification labels (age, gender, nationality). Payment behavior is defined by over 30 metrics such as number of payments and value of payments. Resulting dataset example looks something like this (I included a few random payment behavior metrics on the right):
My goal is to create classes (based on a combination of age/gender/nationality) that represent homogenous groups of people with similar payment behavior. For example: we find that 50-60 year old males from the US all have similar payment behavior. For each class I can then for example determine averages, standard deviations, percentiles etc. Since this seems to be an overlap between clustering and classification, I am stuck in what to research and where to look. Are there any methodologies I can look in to?
An option I'm thinking of would be to first create all possible classes (e.g. 50-M-US, 50-F-US, 51-M-US, etc.) and then merge them based on Euclidian distances (using all payment behavior metrics means) until a desired number of classes is left. Let me know what you think.

Related

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Incorporating prior knowledge to machine learning models

Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.

Which classification to choose?

I have huge amount of yelp data and I have to classify the reviews into 8 different categories.
Categories
Cleanliness
Customer Service
Parking
Billing
Food Pricing
Food Quality
Waiting time
Unspecified
Reviews contains multiple categories so I have used multilable classification. But I am confuse how I can handle the positive/negative . Example review may be for positive for food quality but negative for customer service. Ex- food taste was very good but staff behaviour was very bad. so review contains positive food quality but negative Customer service How can I handle this case? Should I do sentiment analysis before classification? Please help me
I think your data is very similar to Restaurants reviews. It contains around 100 reviews, with varied number of aspect terms in each (More information). So you can use Aspect-Based Sentiment Analysis like this:
1-Aspect term Extraction
Extracting the aspect terms from the reviews.
2-Aspect Polarity Detection
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative.
3-Identify the aspect categories
Given a predefined set of aspect categories (e.g., food quality, Customer service), identify the aspect categories discussed in a given sentence.
4-Determine the polarity
Given a set of pre-identified aspect categories (e.g., food quality, Customer service), determine the polarity (positive, negative) of each aspect category.
Please see this for more information about similar project.
I hope this can help you.
Yes you would need a sentiment analysis. Why don't you create tokens of your data, that is find the required words out of the sentence, now the most possible approach for you is to find the related words along with their sentiment. i.e. food was good but the cleanliness was not appropriate
In this case you have [ food, good, cleanliness, not, appropriate ] now food links with its next term and cleanliness to its next terms "not appropriate"
again you can classify either into two classes i.e. 1,0 for good and bad .. or you can add classes based upon your case.
Then you would have data as such:
--------------------
FEATURE | VAL
--------------------
Cleanliness 0
Customer -1
Service -1
Parking -1
Billing -1
Food Pricing -1
Food Quality 1
Waiting time -1
Unspecified -1
I have given this just as an example where -1,1,0 are for no review, good and bad respectively. You can add more categories as 0,1,2 bad fair good
I may not be so good in answering this, but this is what i feel about it.
Note : You need to understand that you model cannot be perfect because that's what Machine Learning is all about, you have to be wrong. Your model cannot give a perfect classification it has to be wrong for certain inputs which it will learn with time and improve over.
There are many ways of doing multi label classification.
The simplest one would be having a model for each class, and if the review achieves a certain threshold score for that label, you would apply that label to the review.
This would treat the classes independently, but it seems like a good solution to your problem.

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Resources