Imagine you own a postal service and you want to optimize your business processes. You have a history of orders in the following form (sorted by date):
# date user_id from to weight-in-grams
Jan-2014 "Alice" "London" "New York" 50
Jan-2014 "Bob" "Madrid" "Beijing" 100
...
Oct-2017 "Zoya" "Moscow" "St.Petersburg" 30
Most of the records (about 95%) contain positive numbers in the "weight-in-grams" field, but there are a few that have zero weight (perhaps, these messages were cancelled or lost).
Is it possible to predict whether the users from the history file (Alice, Bob etc.) will use the service in Nov., 2017? What machine learning methods should I use?
I tried to use simple logistic regression and decision trees, but they evidently give positive outcome for any user, as there are very few negative examples in the training set. I also tried to apply Pareto/NBD model (BTYD library in R), but it seems to be extremely slow for large datasets, and my data set contains more than 500 000 records.
I have another problem: if I impute negative examples (considering that the user, who didn't send a letter in the certain month is a negative example for this month) the dataset grows from 30 Mb up to 10 Gb.
The answer is yes you can try to predict.
You can approach this as a time series and run RNN:
Train your RNN on your set pivoted so each user is one sample.
You can also pivot your set so each user is a a row (observation) by aggregating each users' data. Then run multivariate logistic regression. You will loose information this way, but it might be simpler. You can add time related columns such as 'average delay between orders', 'average orders per year' etc.
You can use Bayesian methods to estimate the probability with which the user will return.
Related
As a title, I tried to use AutoML in Google Cloud Platform to predict some rare results.
For example, suppose I have 5 types of independent variables: age, living area, income, family size, and gender. I want to predict a rare event called "purchase".
Purchases are very rare, because for 10,000 data points, I will only get 3-4 purchases. Fortunately, I got loads more than just 10,000 data points. (I got 100 million data points)
I have tried to use AutoML to model the best combination, but since this is a rare result, the model only predicts for me that the number of purchases for all types of combinations in these 5 categories is 0. May I know how to solve this problem in AutoML?
In Cloud AutoML, the model predictions and the model evaluation metrics depend on the confidence threshold that is set. By default, in Cloud AutoML, the confidence threshold is 0.5. This value can be changed in the “Evaluate” tab of the “Models” section. To evaluate your model, change the confidence threshold to see how precision and recall are affected. The best confidence threshold depends on your use case. Here are some example scenarios to learn how evaluation metrics can be used. In your case, the recall metric has to be maximized (which would result in fewer false negatives) in order to correctly predict the purchase column.
Also, the training data has to be composed of a comparable number of examples from each class in the target variable so that the model can predict values with a higher confidence. Since your training data is highly skewed, preprocessing of the data such as resampling has to be performed to handle the skewness.
We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.
Let's Say if I have a data with 2 columns:
First column: Employee Age
Second Column: Employee Salary Type
With 100 records similar to this:
Employee Age Employee Salary Type
25 low
35 medium
26 low
37 medium
44 high
45 high
if the Training data is split by the ratio 70:30,
Let the Target variable be Employee Salary Type and predicted variable be Employee Age
The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.
Let's say, 25 out of 30 records have accurate prediction.
Accuracy of the model = (25/30)*100 = 83.33%
Which means the model is good
Lets apply same thing for an unsupervised learning like Clustering.
Here there's no target variable, Only cluster variables are present.
Lets consider both Employee age and Employee Salary as Cluster Variables.
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records).
Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.
Hence accuracy can be accurately measured here.
Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.
I beleive you have a few misconceptions, here is a quick review:
Review
Unsupervised learning
This is when you have data inputs but no labels, and learn something about the inputs
Semi-supervised learning
This is when you have data inputs and same labels, and learn something about the inputs and their relationship to the labels
Supervised learning
This is when you have data inputs and labels, and learn what input maps to which label
Questions
Now you have a few things you mention that dont seem right:
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
This is only guaranteed If you features represent employees using the age and salary, and you are using a clustering algorithm, you need to define a distance metric which says age and salaray are closer to one another
You also mention:
If the Training ratio is applied here,
We can cluster 70 random records and use rest of the
30 records for testing/validating
the above model instead of testing with
some other data (and their records).
Hence accuracy can be accurately measured here.
How do you know the labels? If you are clustering, you would not know what each cluster means as they are assigned only by your distance metric. A cluster usually only signifies distances being either closer or farther away.
You can never know what a correct label is unless you know that a cluster represents a certain label, but if you are using features to cluster and check distance on, they could not also be used for validation.
This is because you would always get 100% accuracy, since a feature is also a label.
A semi-supervised example
I think your misconception comes as you may be confusing learning types, so let's make an example using some fake data.
Let's say you have a table of data with Employee entries like the following:
Employee
Name
Age
Salary
University degree
University graduation date
Address
Now let's say some employees dont want to say their age, since it is not mandatory, but some do. Then you can use a semi-supervised learning approach to cluster employees and get information about their age.
Since we want to get the age, we can approximate by clustering.
Let's make features that represent the Employee age to help us cluster them together:
employee_vector = [salary, graduation, address]
With our input, we are making the claim that age can be determined by salary, graduation date and address, which might be true.
Let's say we have represented all these values numerically, then we can cluster items together.
What would these clusters mean with a standard distance metric Euclidian distance?
People who have less distant salaries, gratuation dates and addresses would be clustered together.
Then we could look at the clusters they are in and look at information about the ages we do know.
for cluster_id, employees in clusters:
ages = get_known_ages(employees)
Now we could use the ages to do lot's of operations to guess missing employee ages like using a normal distribution or just showing a min/max range.
We could never know what the exact age is, since the clustering does not know that.
We could never test for age, since it is not always known, and is not used in the feature vectors for the employees.
This is why you could not use purely unsupervised approaches since you have no labels.
I do not know to who you refer with "why don't people prefer ..." but usually if you are doing an unsupervised analysis you do not have label data and therefore, you cannot measure accuracy. In this case, you can use methods like silhouette or l-curve to estimate the performance of the model.
On the other hand, if you have a supervised task with label data (this example) you can compute the accuracy with cross-validation (test-train split).
Because most unsupervised algorithms are not optimization based. (K-means is an exception!)
Examples: Apriori, DBSCAN, Local Outlier Factor.
And if you do not optimize, how are you going to overfit? (And if you do not use labels, you in particular cannot overfit to these labels).
I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.
I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.
There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.
The user can vote on any tweet. So, each tweet has one of the following three states:
relevant (positive vote)
default (neutral i.e. no vote)
irrelevant (negative vote)
Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.
One last thing, a minimum of semantic consideration (natural language processing) would be great.
I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:
Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.
If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.
If (Tweet score > 0) then this tweet will trigger a notification.
Tweet score = sum of all individual words’ scores of this tweet
word score = word frequency * inverse total frequency
word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word
Inverse total frequency = log ( total votes of all words / total votes for this word)
Is this method enough? I am open to any better methods and any ready API or algorithm.
One possible solution would be to train a classifier such as Naive Bayes on the tweets that a user has voted on. You can take a look at the documentation of scikit-learn, a Python library, which explains how you can easily preprocess your text and train such a classifier.
I would look at Naive Bayes, however I would also look at the K-Nearest Neighbours algorithm when performing a simple classification - this is contained within the Sci-kit Learn library and documented well.
RE: "running SKLearn on GAE is not possible" - you will either need to use the Google Predict API, or, run a VPS which would serve as a worker to process your classification tasks; this would obviously have to live on a different system though.
I would say though, if you are only hoping to perform simple classification on a suitably small dataset, you could actually implement a classifier in JavaScript, like
`http://jsfiddle.net/bkanber/hevFK/light/`
With a JS implementation, the processing time will become unacceptably slow if the dataset is too large, but it's nice to have as an option, even preferable in many cases.
Ultimately, GAE is not the platform I would use when building anything which may require all but the most basic of ML techniques. I would look at Heroku or a VPS in such a place as Digital Ocean, AWS et al.