What machine learning algorithms can be used in this scenario? - machine-learning

My data consists of objects as follows.
Obj1 - Color - shape - size - price - ranking
So I want to be able to predict what combination of color/shape/size/price is a good combination to get high ranking. Or even a combination could work like for eg: in order to get good ranking, the alg predicts best performance for this color and this shape. Something like that.
What are the advisable algorithms for such a prediction?
Also may be if you can briefly explain how I can approach towards the model building I would really appreciate it. Say for eg: my data looks like
Blue pentagon small $50.00 #5
Red Squre large $30.00 #3
So what is a useful prediction model that I should look at? What algorithm should I try to predict like say highest weightage is for price followed by color and then size. What if I wanted to predict in combinations like a Red small shape is less likely to higher rank compared to pink small shape . (In essence trying to combine more than one nominal values column to make the prediction)

Sounds like you want to learn models that you can interpret as a human. Depending on what type your ranking variable is, a number of different learners are possible.
If ranking is categorical (e.g. stars), a classifier is probably best. There are many in Weka. Some that produce models that are understandable by humans are the J48 decision tree learner and the OneR rule learner.
If the ranking is continuous (e.g. a score), regression might be more appropriate. Suitable algorithms are for example SimpleLogistic and LinearRegression.
Alternatively, you could try clustering your examples with any of the algorithms in Weka and then analyzing the clusters. That is, ideally examples in a cluster would all be of the same (or very similar) ranking and you can have a look at the range of values of the other attributes and draw your own conclusions.

Treat the combination as a linear equation, and apply a Monte Carlo algorithm (like Genetic Algorithm) to tune the parameters of the equation.
Code the color/shape/size/price/rankings into digital values.
Treat the combination as a linear equation, say a*color + b*shape + c*size + d*price = ranking.
Apply Genetic Algorithm to tune a/b/c/d, in order to make calculated rankings to be as closer to the ground-truth as possible.
Finally you got the equation, you could use it to:
1) find maximal rankings by a simple linear planning;
2) predict rankings by just assign other parameters.

Related

Feature selection - how to go about it when you have way too many features?

Let's assume you have 1,400 columns/data points for 200k entries and your goal is to determine which of these columns show the most signal towards a simple classification task.
I've already removed columns with a threshold of null values, low variance, bad and also too many levels for categorical, and I still have 900+ columns.
I can use lasso if I only include the 500+ numerical columns, but if I try to include the categorical as well I keep crashing, it's too much data to process.
How would you go about further reducing features in that case? My goal, more than the classification itself, is to identify the features that bring in the most information towards the classification task.
You could use a data driven approach, for example the most simple one would be to use the L1 regularisation on a logistic regression (with your simple classification task) and looking at the weights you select the ones that are not zero or close to zero.
Basically the L1 norm on the model weights enforces the sparsity of the weights vector, and in doing so, the only surviving weight are the ones corresponding to the "important" features.
In any case be careful and normalise the data before using this technique and also be careful about categorical and scalar features...
You can also use a Neural network, and then compute the gradient w.r.t. the input to see what influences the decision more.
Or some other technique like: https://link.springer.com/chapter/10.1007/978-3-030-33778-0_24
Alternatively you can also use a Random Forest model and do feature importance like: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

How to model multiple inputs to single output in classification?

Purpose:
I am trying to build a model to classify multiple inputs to a single output class, which is something like this:
{x_i1, x_i2, x_i3, ..., x_i16} (features) to y_i (class)
I am using a SVM to make the classification, but the 0/1-loss was bad (half of the data a misclassified), which leads me to the conclusion that the data might be non-linear. This is why I played around with polynomial basis function. I transformed each coefficient such that I get any combinations of polynomials up to degree 4, in the hope that my features are linear in the transformed space. my new transformed input looks like this:
{x_i1, ..., x_i16, x_i1^2, ..., x_i16^2, ... x_i1^4, ..., x_i16^4, x_i1^3, ..., x_i16^3, x_i1*x_i2, ...}
The loss was minimized but still not quite where I want to go. Since with the number of polynomial degree the chance of overfitting rises, i added regularization in order to counter balance that. I also added a forward greedy algorithm in order to pick up the coefficients which leads to minimal cross-validation error, but with no great improvement.
Question:
Is there a systematic way to figure out which transform leads to linear feature behaviour in the transformed space? Seems little odd to me that I have to try out every polynomial until it "fits". Are there perhaps better basis functions except polynomials? I understand that in low dimensional feature space, one can simply plot the data out and estimate the transform visually, but how can I do it in high dimensional space?
Maybe a little off topic but I also informed myself about PCA in order to throw away the components which doesnt provide much informations in the first place. Is this worth a try?
Thank you for your help.
Did you try other kernel functions such as RBF other than linear and polynomial? Since different dataset may have different characteristics, some kernel functions may work better than others do, especially in non-linear cases.
I don't know which tools you are using, but the following one also provides a guide for beginners on how to build SVM models:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
It is always a good idea to have a feature selection step first, especially for high-dimensional data. Those noisy or irrelevant features should be taken away, leading to a better performance and higher efficiency.

How can I normalize data to have same average sum of square?

In a lot of articles in my field, this sentence has been repeated: " The 2 matrices has been normalized to have the same average sum-of-squares (computed across all subjects and all voxels for each modality)". Suppose that we have two matrices that the rows define different subjects and the columns are features (voxels). In these articles, no much explanation can be found for normalization method. Does anybody knows how I should normalize data to have "same average sum-of-squares"? I don't understand it at all. Thanks
For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the
Sum of squares:
SUM_i(Xi-Xbar)²
Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you're using and what you're aiming at.
Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.

Resources