how to deal with features which are both numerical and categorical - machine-learning

what is the best approach to deal with features which is both numerical and categorical? take the following feaure X for example:
X
1
5
3
0
1
10
10
7
0
5
9
9
In which X represents credit score which should range 1 to 10, and if X=0, it means for this instance, the credit score doesn't exist.
how should I deal with it while using models like random forest or logistic regression to do a classification problem? Thank you.

Related

Machine Learning Classification

My target is a range 1 to 5. Is there a way to force to predict only in this range?
Regardless of the model I use, I sometimes get negative values ​​and values ​​greater than 5.
You can use a model that supports multiple classes classification such as Softmax Regression. This algorithm is a generaliztion of Logistic regression that can classify N targets where N > 1.
The hard prediction of your model can be:
1 2 3 4 5
0 0 0 1 0
Which means that the prediction is 4
or it can be a soft prediction:
1 2 3 4 5
0.1 0.1 0.6 0.1 0.1
Which is probability and then you can know how confident is your model.
Scikit-learn implements Softmax regression within Logistic regression algorithm itself by specifying the parameter multi_class="multinomial"

Classification with imbalanced dataset using Multi Layer Perceptrons

I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.

multivariate random forest with opencv

Let's say we are trying to classify a pencil as healthy or not and we have two variables for this purpose: length and weight of the pencil. Now, what should I give to the training method of random forest implemented in opencv? I am really confused with this because I have two different data, both of them are numeric but their units are different. Below example will give a better sense:
Height (cm) Weight (gr) Healthy? (bool)
----------- ----------- ---------------
10 34 0
4 6 0
12 14 1
8 20 1
5 18 0
If I train a univariate random forest with only height, {10, 4, 12, 8, 5} and {0, 0, 1, 1, 0} vectors will be the parameters. However, what if I want to use both variables, what will be the parameters?
In Python the training data input can be fed into as a list of tuples, if you have multiple variables.

Learning how to map numeric values into an array

Deal all,
I am looking for an appropriate algorithm which can allow me to learn how some numeric values are mapped into an array.
Try to imagine that I have a training data set like this:
1 1 2 4 5 --> [0 1 5 7 8 7 1 2 3 7]
2 3 2 4 1 --> [9 9 5 6 6 6 2 4 3 5]
...
1 2 1 8 9 --> [1 4 5 8 7 4 1 2 3 4]
So that given a new set of numeric values, I would like to predict this new array
5 8 7 4 2 --> [? ? ? ? ? ? ? ? ? ?]
Thank you very much in advance.
Best regards!
Some considerations:
Let us suppose that all numbers are integer and the length of the arrays is fixed
Quality of each predicted array can be determine by means of a distance function which try to measure the likeness between the ideal and the predicted array.
This is a challenging task in general. Are your array lengths fixed? What's the loss function (for example is it better to be "closer" for single digits -- is predicting 2 instead of 1 better than predicting 9 or it doesn't matter? Do you get credit for partial matches on the array, such as predicting the first half correct? etc)?
In any case, classical regression or classification techniques would likely not work very well for your scenario. I think the best bet would be to try a genetic programming approach. The fitness function would then be your loss measure i mentioned earlier. You can check this nice comparison for genetic programming libraries for different languages.
This is called a structured output problem, where the target you are trying to predict is a complex structure, rather than a simple class (classification) or number (regression).
As mentioned above, the loss function is an important thing you will have to think about. Minimum edit distance, RMS or simple 0-1 loss could be used.
Structured support vector machine or variations on ridge regression for structured output problems are two known algorithms that can tackle this problem. See wikipedia of course.
We have a research group on this topic at Universite Laval (Canada), led by Mario Marchand and Francois Laviolette. You might want to search for their publications like "Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction" by Sebastien Giguere et al.
Good luck!

How can I use the rule-based learning algorithms for this example

I have data as follows in order to do a predictive learning as to what feature do people find attractive in a model when purchasing clothes online.
So I have data as follows.
COLORofCLOTHING MODELHAIR_COLOR MODEL_BUILD SELLER_CATEGORY
Red Black Lean 1
Blue Brown Lean 5
Black Blonde Healthy 10
In order to predict if the clothing will sell well given a set of attributes.
However seller category can be anything between 1 to 10 (1 being best and 10 being worst) I am not sure how to approach this problem. I am using weka for this purpose. Can people please give me ideas on how to approach this problem?
basically I want to build a model which learns the features like color of the clothing etc and can predict how well the clothes will sell.
Transform and normalise your dataset into something along the lines of:
color_red color_blue color_black hair_black hair_brown hair_blonde ... prediction
1 0 0 1 0 0 ... 0
0 1 0 0 1 0 ... 0.5
0 0 1 0 0 1 ... 1
Random Forests and Neural Networks should be able to give you predictions.

Resources