Continuous or categorical data in data science - machine-learning

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.

The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.

But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

Related

What does the ranker in Weka PCA tell us about feature selection?

I have a data set that is 31000 rows with 13 attributes. But because most are categorical I had to use NominalToBinary for those attributes so the attributes grew to 61.
I have sampled the data to 18000 rows and applied the PCA with ranker in Weka. centerData is false so it should normalise it for me.
This is my result:
0.945 1 -0.367Marial_Status= Married-civ-spouse-0.365Relationship= Husband+0.298Marial_Status= Never-married+0.244Age=0_23+0.232Gender= Female...
I understand that the ranking is the variance. So rank 1 is 94.5%? Now the issue I have with feature selecting is how do i know which ones to keep? Most of these attributes are categorical and changed to numeric for the PCA. So with the original data-set with both categorical and numeric, with respects to this output what is it saying about feature selecting?
PCA assumes numerical data. If you binary encode you categorical variables you basically take a hammer and make you data fit your models assumption.
Another way to deal with categorical features are non-linear feature transformations which will find a way to represent distances between categories in a suitable way. A quick google search provided Categorical Principal Components Analysis (CTPCA) for me. Maybe have a look at this.

machine learning, nominal data normalization

i am working on kmeans clustering .
i have 3d dataset as no.days,frequency,food
->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]
->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]
the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).
if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).
"before, i did x/max(x) as well for days but not acceptable"
so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.
food => 1,2,3
frequency => 1-36
The point of normalization is not just to get the values small.
The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.
For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.
For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.
Unless you have above property, don't use k-means.
You can try to use the Value Difference Metric, VDM (or any variant) to convert pretty much every nominal attribute you encounter to a valid numeric representation. An after that you can just apply standardisation to the whole dataset as usual.
The original definition is here:
http://axon.cs.byu.edu/~randy/jair/wilson1.html
Although it should be easy to find implementations for every common language elsewhere.
N.B. for ordered nominal attributes such as your 'frequency' most of the time it is enough to just represent them as integers.

A binary classification dataset with a 'age' feature whose some of values are missing

This classification problem has 300000 tuples and 20 features. I want to use SVM algorithm to solve this problem. The 'age' feature is between 1 and 100, but this feature of some tuples is missing and blank. How should i solve it.
This of course depends on the distribution of your missing variable, but I would try imputation - try to fill in the blanks using a mean age value and see what kind of results do you get. One step further would be to create a model predicting age given the other input variables and use that for imputation.
You might also add a variable indicating that a given row has some imputed values - this in some cases yields better training results, as you give your algorithm more information.
Additionally to simple imputation by mean as already mentioned by #dratewka, I would suggest trying:
Imputing the feature using classic imputation mechanisms, like e.g. K nearest neighbour imputation. With this, for a sample S with age being missing, those K samples that are nearest to S are used to derive a suitable value for imputing age (with the distance of K neighbours to S measured with all other features).
After performing the previous step, try your prediction with using age and with leaving it out. In case you see that your prediction performance is not influenced by age, disregarding this information altogether in the first place might be reasonable as well.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation

So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.

Resources