So i have a classification problem, in which i have to classify crimes into categories based on different features ( SF Crime competition on Kaggle for those who are familiar). An interesting aspect which takes place in this data set is that there are 2 extra features "Descript"and "Resolution",both containing short text, which are present ONLY in the training set and not in the test set. They both have little pieces of text as values such as "UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE","VIOLATION OF STAY AWAY ORDER",etc.
My question is, how can I use these fields even though they appear only in the training set ? Currently I am discarding them, but I want to extract some info from them.
Related
I am working on an NLP project where I need to predict correct classes of short sentences -- which are instances in my case. I am using root-words as features. My dataset is not too large (about 6000 instances/sentences). Since there are too many features I used MI based feature-selection method to reduce the number of features to about 1000.
My problem is: if I split the dataset and then do feature selection on training set only, then the model/classifier is built based on features available in training set only -- most of which (features in trained model) are absent in the testing set. As a result our model may perform very bad.
What should I do to resolve this issue?
I am currently selecting features first and then doing CV. I know that this approach may cause data leakage from test set to train set. But I'm still doing that because of the aforementioned issue.
I know the general rule that we should test a trained classifier only on the testing set.
But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?
And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?
I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.
p.s.:
p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables
In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.
p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.
I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.
The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.
So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.
For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.
The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.
The question above Do I have to shift all of the features has the simple answer: Yes.
In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.
This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.
This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:
Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).
See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this.
230106:
The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.
On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).
You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:
See also:
Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)? with the "walk-forward" main image,
Splitting Time Series Data into Train/Test/Validation Sets with more insight into this and the comment that brought up the model name "walk-forward".
I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.
I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).
My data set has the following structure:
user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...
It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.
Considering this scenario, how should I do the splitting?
You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.
I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.
Select one observation for each user
For each item that has only one observation and has not already been selected from the previous step select one observation
Merge the results of the previous two steps together.
This should produce a set of observations that covers all of the users and all of the items.
Calculate how many observations you need to fill your training set (generally 80% of the total number of observations)
Calculate how many observations are in the merged set from step 3.
The difference between steps 4 and 5 is the number of remaining observations necessary to fill the training set.
Randomly select enough of the remaining observations to fill the training set.
Merge the sets from step 3 and 6: this is your training set.
The remaining observations is your testing set.
As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.
You're quite correct in your basic logic: if you have only one observation in a class, you must include that in the training set for the model to have any validity in that class.
However, dividing the input into these classes depends on the interactions among various observations. Can you identify classes of data, such as the "only rating" issue you mentioned? As you find other small classes, you'll also need to ensure that you have enough of those observations in your training data.
Unfortunately, this is a process that's tricky to automate. Most one-time applications simply have to hand-pick those observations from the data, and then distribute the others per normal divisions. This does have a problem that the special cases are over-represented in the training set, which can detract somewhat from the normal cases in training the model.
Do you have the capability of tuning the model as you encounter later data? This is generally the best way to handle sparse classes of input.
collaborative filtering (matrix factorization) can't have a good recommendation for an unseen user with no feedback. Nevertheless, an evaluation should consider this case and take it into account.
One thing you can do is to report performance for all test users, just test users with some feedback and just unseen users with no feedback.
So I'd say keep the test, train split random but evaluate separately for unseen users.
More info here.
I'm currently working with the CHILDES corpus trying to create a classifier that distinguishes children whom suffer from specific language impairment (SLI) from those who are typically developing (TD).
In my readings I noticed that there really isn't a convincing set of features to distinguish the two that have been discovered yet, so I came upon the crazy idea of trying to create a feature learning algorithm that could potentially make better ones.
Is this possible? If so how do you suggest I approach this? From the reading I have done, most feature learning is done on image processing. Another problem is the dataset I have is potentially too small to make it work (in the 100's) unless I find a way to get more transcripts from children.
Create a dataset consisting of children text with three labels:
1- Normal
2- SLI
3- TD
So you'll have 3 labels.
You put aside 40% of your dataset 20% for development and 20% for test.
Then you run a LogisticRegression Classifier (e.g. using scikit-learn) using bag of character n-gram features. You can easily do this by TfidfVectorizer in scikit-learn.
Then you train the model over the 60% training set and you tune the hyper-parameters (e.g. regularization strength) by choosing the best performing development model.
Then, you train again using the chosen hyper-parameters and you get the top important features as in this example.
For each class, it gives you the weight of features associated with each label so you'll have your top linguistic symptoms for each of your two diseases.
I'm using CF algorithm(SVD) on a real world data set. Now I meet a problem about the data sparse problem. That means the sparsity of the user/item rating matrix is around 0.01%. I split the data into train/test set with 80/20, I find that there're just a few users and items in testing set appear in the training set, so I can just use a few rating in testing set to calculate RMSE. Would you give me some advise to fix it?
In case of recommender systems one usually splits each user's history into train and test. More detailed:
For each user we write out items he interacted with.
Preferably, we order them by (incresing) time to overcome "time-traveling issue" (user can revisit already known items, so you don't want to test on early dataset).
As usual, you use first (1-k) percents of your dataset as a train set and the rest as a test set.