Related
I have a problem. I would like to use a classification algorithm. For this I have a column materialNumber, like the name the column represents the material number.
How could I use that as a feature for my Machine Learning algorithm?
I can not use them e.g. as a One Hot Enconding matrix, because there is too much different material numbers (~4500 unique material numbers).
How can I use this column in a classification algorithm? Do I need to standardize/normalize it? I would like to use a RandomForest classifier.
customerId materialNumber
0 1 1234.0
1 1 4562.0
2 2 1234.0
3 2 4562.0
4 3 1547.0
5 3 1547.0
Here you can group material numbers by categorizing them. If you want to use a categorical variable in a machine learning algorithm, as you mentioned, you have to use the "one-hot encoding" method. But here, as the unique material number values increase, the number of columns in your data will also increase.
For example, you have a material number like this:
material_num_list=[1,2,3,4,5,6,7,8,9,10]
Suppose the numbers are similar in themselves, for example:
[1,5,6,7], [2,3,8], [4,9,10]
We ourselves can assign values to these numbers:
[1,5,6,7] --> A
[2,3,8] --> B
[4,9,10] --> C
As you can see, our tag count has decreased. And we can do "one-hot encoding" with fewer tags.
But here, the data set needs to be examined well and this grouping process needs to be done in a reasonable way. It might work if you can categorize the material numbers as I mentioned.
I'm trying to train a model using H2O.ai's H2O-3 Automl Algorithm on AWS SageMaker using the console.
My model's goal is to predict if an arrest will be made based upon the year, type of crime, and location.
My data has 8 columns:
primary_type: enum
description: enum
location_description: enum
arrest: enum (true/false), this is the target column
domestic: enum (true/false)
year: number
latitude: number
longitude: number
When I use the SageMaker console on AWS and create a new training job using the H2O-3 Automl Algorithm, I specify the primary_type, description, location_description, and domestic columns as categorical.
However in the logs of the training job I always see the following two lines:
Converting specified columns to categorical values:
[]
This leads me to believe the categorical_columns attribute in the training hyperparameter is not being taken into account.
I have tried the following hyperparameters with the same output in the logs each time:
{'classification': 'true', 'categorical_columns':'primary_type,description,location_description,domestic', 'target': 'arrest'}
{'classification': 'true', 'categorical_columns':['primary_type','description','location_description','domestic'], 'target': 'arrest'}
I thought the list of categorical columns was supposed to be delimited by comma, which would then be split into a list.
I expected the list of categorical column names to be output in the logs instead of an empty list, like so:
Converting specified columns to categorical values:
['primary_type','description','location_description','domestic']
Can anyone help me figure out how to get these categorical columns to apply to the training of my model?
Also-
I think this is the code that's running when I train my model but I have yet to confirm that: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L93-L151
This seems to be a bug by h2o package. The code in https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106 shows that it's reading categorical_columns directly from the hyperparameters, not nested under the training field. However when move up the categorical_columns field a level, the algorithm doesn't recognize it. So no solution for this.
It seems based on the code here: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106
that the parameter is looking for a comma separated string. E.g. "cat,dog,bird"
I would try: "primary_type,description,location_description,domestic"as the input parameter, rather than ['primary_type', 'description'... etc]
I have dataset of blood test with four columns . Blood groups are A+,B+,B-,O (string), antibodies values (numerical), age (numerical) and gender (string). I am feeding strings with categorical conversion (A+ = 00 , Male = 0) and numerical values as numbers for input.
As my neural network is trained I want it to predict any of the four values (blood group, antibodies amount, age , gender ). Below is the desired structure of my Neural Network.
1) Will my input for string types as categorical work ?
2) What to give as input if I want it to predict that value (for example I want it to predict 'antibodies' amount. Should I be giving it the following inputs (A+,50,M,?) = (0,0,50,0,?) ). What should I give as input for the missing value if its is numerical or string (for both type of inputs) ?
I thought of using 0 as input but that would give another mapping of relation with categorical inputs , am I right ?
I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)
I am currently working on a project that focuses on relation extraction from a corpus of Wikipedia text, and I plan to use an SVM to extract these relations. To model this, I plan to use Word features, POS Tag features, Entity features, Mention features and so on as mentioned in the following paper - https://gate.ac.uk/sale/eswc06/eswc06-relation.pdf (Page 6 onwards)
Now, I have set up the pipeline for feature extraction and got the corpus annotated and I wish to use a package like SVM-Light for the purpose of the project. According to the input file format of the SVM-Light package, this is the requisite format -
.=. : : ... : #
Example (from the SVM-Light webpage) -
In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line
-1 1:0.43 3:0.12 9284:0.2 # abcdef
specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information for user defined kernels.
Now, I wish to know how do we model the features that I am using whose values include words, POS Tags and entity types and subtypes into the feature vector accepted by the SVM-Light package, where each feature has a real number value associated with it. How is the mapping from my choice of features to these real values done?
It would be of great help if someone who has worked at a similar problem before could just prod me in the right direction.
Thanks.