I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.
Related
I'm working on a ML prediction model and I have a dataset with a categorical variable (let's say product id) and I have 2k distinct products.
If I convert this variable with dummy variables like one hot enconder, the dataset may have a size of 2k times the number of examples (millions of examples), but it's too many to be processed.
How is this used to be treated?
Should I use the variable only with the whitout the conversion?
Thanks.
High cardinality of categorial features is a well-known problem and "the best" way typically depends on the prediction task and requires a trial-and-error approach. It is case-dependent if you can even find a strategy that is clearly better than others.
Addressing your first question, a good collection of different encoding strategies is provided by the category_encoders library:
A set of scikit-learn-style transformers for encoding categorical variables into numeric
They follow the scikit-learn API for transformers and a simple example is provided as well. Again, which one will provide the best results depends on your dataset and the prediction task. I suggest incorporating them in a pipeline and test (some or all of) them.
In regard to your second question, you would then continue to use the encoded features for your predictions and analysis.
What's your approach to solving a machine learning problem with multiple data sets with different parameters, columns and lengths/widths? Only one of them has a dependent variable. Rest of the files contain supporting data.
Your query is too generic and irrelevant to some extent as well. The concern around columns length and width is not justified when building a ML model. Given the fact that only one of the datasets has a dependent variable, there will be a need to merge the datasets based on keys that are common across datasets. Typically, the process followed before doing modelling is :
step 0: Identify the dependent variable and decide whether to do regression or classification (assuming you are predicting variable value)
Clean up the provided data by handling duplicates, spelling mistakes
Scan through the categorical variables to handle any discrepancies.
Merge the datasets and create a single dataset that has all the independent variables and the dependent variable for which prediction has to be done.
Do exploratory data analysis in order to understand the dependent variable's behavior with other independent variables.
Create model and refine the model based on VIF (Variance Inflation factor) and p-value.
Iterate and keep reducing the variables till you get a model which has all the
significant variables, stable R^2 value. Finalize the model.
Apply the trained model on the test dataset and see the predicted value against the variable in test dataset.
Following these steps at high level will help you to build models.
I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.
High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.
So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".
I have a dataset of a particular domain (say sports - 1 class). What I want to do is when I fed a web page to the classifier/clusterer I want to get a result whether that instance (web page) is related to sports or not.
Most of the classifiers in weka are not capable of dealing with unary class datasets except the LibSVM (wrapper). I did some tests with the LibSVM, but the problem is during tests on a unrelated dataset, I get all of them correctly classified, even if the instances are empty! Any suggestions?
What if I use the cosine similarity measure here?
Have you seen this thread unary class text classification in weka? and this post https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011631.html ?
I'm assuming you meant that when you run the classifier against another dataset that is not "sports" it gets the results incorrectly classified (i.e. false positives) e.g. "this is sports".
Are you certain your dataset only contains one class? Did you make sure the dataset does not contain any empty instances? (don't mock, this has happened to me before).
In the comments of the previously mentioned thread there is a linked to a PDF on tuning SVM: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf - I would say SVMs are a bit harder than other common classifiers.
As an alternative, can't you switch the problem to binary classification? It's much easier to get good results and for most problems there are plenty of examples of things that are not in that class e.g. sports websites vs funny image web sites, programming websites, etc ...
PS: you can use other algorithms for outlier detection: http://en.wikipedia.org/wiki/Outlier_detection