The Difference between One Hot Encoding and LabelEncoder? - machine-learning

I am working on a ML problem to predict house prices and Zip Code is one feature which will be useful. I am also trying to use Random Forest Regressor to predict the log of the price.
However, should I use One Hot Encoding or Label Encoder for Zip Code? Because I have about 2000 Zip Codes in my dataset and performing One Hot Encoding will expand the columns significantly.
https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
To rephrase: does it make sense to use LabelEncoder instead of One Hot Encoding on Zip Codes

Like the link says:
LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but
then the imposed ordinality means that the average of dog and mouse is
cat. Still there are algorithms like decision trees and random forests
that can work with categorical variables just fine and LabelEncoder
can be used to store values using less disk space.
And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.
**PS:** Give 2 power 11 is 2048 and you have 2000 categories for zipcodes, you can reduce your feature columns to 11 instead of 1999 in the case of one hot encoding!

Related

Which Feature Selection Techniques for NLP is this represent

I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

Feature engineering, handling missing data

Consider this data table
NumberOfAccidents MeanDistance
1 5
3 0
0 NA
0 NA
6 1.2
2 0
the first feature is the number of accidents and the second is the average distance of these accidents to a certain point. It is obvious for a record with zero accident, there won't be a value for MeanDistance. However, imputing these missing values are not logical!
MY SOLUTION: I have decided to discretize the MeanDistance with NAs being a level (bin) and the rest of the data being in bins like: [0,1), [1,2.5), [2.5, Inf). the final table will look like this:
NumberOfAccidents NAs first_bin sec_bin third_bin
1 0 0 0 1
3 0 1 0 0
0 1 0 0 0
0 1 0 0 0
6 0 0 1 0
2 0 1 0 0
What is your idea with these types of missing values that cannot be imputed?
what is your solution to this problem?
It really depends on the domain and what you are trying to predict. Even though your solution is fine, I wouldn't bin the rest of the data as you did. Giving that the NumberOfAccidents feature already tells what MeanDistance have NA values, I would probably just impute 0 into the NA values (for computations) and leave the rest of the data as it is.
Nevertheless, there is no need to limit yourself, just try different approaches and keep the one that boost your KPI (Key Performance Indicator).

Weka not printing the label for prediction

I am trying to output the predictions of a test data set after loading a model into weka. The file is in .csv format and the classifier I am using is NaiveBayes. I am setting the supplied test to a test file which has about 110000 instances with label as ?. When I run the model on this test file and output the predictions into a .csv file I get a file like this:
inst# actual predicted error prediction
1 0 0.677 0.677
2 0 0.978 0.978
3 0 1 1
4 0 1 1
5 0 0.991 0.991
6 0 0.996 0.996
7 0 1 1
8 0 0.999 0.999
9 0 0.996 0.996
10 0 0.965 0.965
Can anyone tell me why the prediction column is empty? Why isn't the label being printed and how to resolve this.
I am very new to weka and was unable to solve this.

XNA curve import from Maya?

I am trying to import a movement curve from Maya into my XNA game, but I cannot figure out how. Basically I want to catch the curve by it's name, and look up its values at different points of time.
Are curves exported into FBX at all? And, if not, then how to catch it?
Edit: Maya can export to Maya ASCII, and I tried to parse it, but I am not sure what formula I should use to recreate the curve.
Here is a Maya ASCII segment defining a typical curve:
createNode transform -name "curve1";
createNode nurbsCurve -name "curveShape1" -parent "curve1";
setAttr -keyable off ".visibility";
setAttr ".cached" -type "nurbsCurve"
3 11 0 no 3
16 0 0 0 1 2 3 4 5 6 7 8 9 10 11 11 11
14
-4.9774564508407968 0 -6.8331005825440476
-5.5957526204336077 0 -5.5944567905896161
-6.8323449596191823 0 -3.1171692066807277
-5.6935230034445992 0 3.3047128765440847
-1.6528787527978079 0 8.8676235621397499
7.5595909161095838 0 10.325347443191644
9.2297347448508607 0 8.5586791722955731
10.0730315036276 0 0.93412333819133941
5.9770106513247976 0 3.7809964481624871
2.9006817236214149 0 -3.3327711853359037
11.373191256465434 0 -4.6672854260704906
4.5697574985247682 0 -14.178349348937205
2.4191279569332935 0 -11.415532638650156
1.3438131861375628 0 -10.034124283506653
;
I managed to find the file format reference somewhere, the important info here is the knot indexes (16 0 0 0 1 2 3 4 5 6 7 8 9 10 11 11 11) and the coordinates (all lines containing three numbers).
But, I still have no idea how to recreate the curve. I googled a lot for nurbscurves, bsplines etc, but could not successfully match the result in Maya with any code I could find.
I've achieved this in 3dsmax by exporting the curve in Ascii format and parsing the text manually, does Maya have any such exporter?

Resources