Prediction based on large texts using Vowpal Webbit - machine-learning

I want to use the resolution time in minutes and the client description of the tickets on Zendesk to predict the resolution time of next tickets based on their description. I will use only this two values, but the description is a large text. I searched about hashing the feature values instead of hash the feature name on Vowpal Wabbit but with no success. Wich is the better approach to use feature values that is large texts to predict using Vowpal?

Values of features in Vowpal Wabbit can only be real numbers. If you have a categorical feature with n possible values you simply represent it as n binary features (so e.g. color=red is a name of a binary feature and its value is 1 by default).
If you have a text description you can use the individual words of the text as features (i.e. feature names). You only need to escape ":", "|" and whitespace characters in feature names, all other characters are allowed (including "="). So an example can look like
9 |USER avg_time:11 |SUMMARY words:5 sentences:1 |TEXT I have a big problem
So this ticket with text "I have a big problem" took 9 minutes to resolve and previous tickets from the same user took on average 11 minutes to resolve.
If you have enough training examples, I would recommend to add many more features (any details about the user, more summary features about the text etc). Also the time of day (morning, afternoon, evening) and day of week when the ticket was reported may be a good predictor (tickets reported on Friday evening tend to take longer), but maybe you intentionally don't want to model this and focus only on the "difficulty" of the ticket irrelevant of reporting time.
You can also try using word bigrams as features with --ngram T2, which means that 2-grams features will be created for all namespaces beginning with T (only TEXT namespace in my example). Maybe the individual words "big" and "problem" are not strong predictors, but the bigram "big problem" will get a high positive weight (indicating it is a good predictor of long resolution time).
I will use only this two values
You mean resolution time and text of the ticket, am I right? But the resolution time is the (dependent) variable you want to predict, so this does not count as a feature (aka independent variable). Of course, if you know the identity of the user and have enough training examples for each user, you can include the average time of previous tickets (excluding the current one, of course) of the user as a feature as I tried to show in the example.

Related

Features from consecutive measurements for classification

I'm currently working on a small machine learning project.
The task deals with medical data of a couple of thousands of patients. For each patient there where taken 12 of measurements of the same bunch of vital signs each one hour apart.
These measurements must note been taken immediately after the patient has entered the hospital but could start with some offset. However the patient will stay 24h in the hospital in total, so they can't start later than after 11 hours after the entrance.
Now the task is to predict for each patient whether none, one or multiple of 10 possible tests will be ordered during the remainder of the stay, and also to predict the future mean value of some of the vital signs for the remainder of the stay.
I have a training set that comes together with the labels that I should predict.
My question is mainly about how I can process the features, I thought about turning the measurement results for a patient into one long vector and use it as training example for a classifier.
However I'm not quite shure how I should include the Time information of each measurement into the features (should I even consider time at all?).
If I understood correctly, you want to include time information of each measurement into features. One way I thought is to make an empty vector of length 24, as the patient stays for 24 hours in the hospital. Then you can use one-hot representation, for example, if the measurement was taken in 12th, 15th and 20th hours of his stay, your time feature vector will have 1 at 12th, 15th and 20th position and all others are zero. You can append this time vector with other features and make a single vector for each patient of length = length(other vector) + length(time vector). Or you can use different approaches to combine these features.
Please let me know if you think this approach makes sense for you. Thanks.

Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

I know the general rule that we should test a trained classifier only on the testing set.
But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?
And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?
I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.
p.s.:
p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables
In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.
p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.
I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.
The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.
So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.
For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.
The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.
The question above Do I have to shift all of the features has the simple answer: Yes.
In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.
This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.
This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:
Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).
See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this.
230106:
The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.
On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).
You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:
See also:
Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)? with the "walk-forward" main image,
Splitting Time Series Data into Train/Test/Validation Sets with more insight into this and the comment that brought up the model name "walk-forward".

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

Assign a short text to one of two categories according to previous assignments (votes)

There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.
The user can vote on any tweet. So, each tweet has one of the following three states:
relevant (positive vote)
default (neutral i.e. no vote)
irrelevant (negative vote)
Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.
One last thing, a minimum of semantic consideration (natural language processing) would be great.
I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:
Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.
If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.
If (Tweet score > 0) then this tweet will trigger a notification.
Tweet score = sum of all individual words’ scores of this tweet
word score = word frequency * inverse total frequency
word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word
Inverse total frequency = log ( total votes of all words / total votes for this word)
Is this method enough? I am open to any better methods and any ready API or algorithm.
One possible solution would be to train a classifier such as Naive Bayes on the tweets that a user has voted on. You can take a look at the documentation of scikit-learn, a Python library, which explains how you can easily preprocess your text and train such a classifier.
I would look at Naive Bayes, however I would also look at the K-Nearest Neighbours algorithm when performing a simple classification - this is contained within the Sci-kit Learn library and documented well.
RE: "running SKLearn on GAE is not possible" - you will either need to use the Google Predict API, or, run a VPS which would serve as a worker to process your classification tasks; this would obviously have to live on a different system though.
I would say though, if you are only hoping to perform simple classification on a suitably small dataset, you could actually implement a classifier in JavaScript, like
`http://jsfiddle.net/bkanber/hevFK/light/`
With a JS implementation, the processing time will become unacceptably slow if the dataset is too large, but it's nice to have as an option, even preferable in many cases.
Ultimately, GAE is not the platform I would use when building anything which may require all but the most basic of ML techniques. I would look at Heroku or a VPS in such a place as Digital Ocean, AWS et al.

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

Resources