How to select attributes with respect to Information Gain in weka? - machine-learning

i am working in weka for text classification. I have total of 113232 attributes in vocabulary out of which i want to select top 10, 000 attributes. Follwing is setting of my informationGain filter
AttributeSelection featureSelectionFilter = new AttributeSelection();
InfoGainAttributeEval informationGain = new InfoGainAttributeEval();
Ranker ranker = new Ranker();
ranker.setNumToSelect(10000);
ranker.setThreshold(0);
I assumed that it may arrange the attributes in descending order with respect to their information gain, i am not sure i am right or wrong in my assumption here is an image of three attributes
The maximum value, std dev, mean all of first attribute is higher than other which may indicate its importance but these values for second attribute is less than 3rd ? Is it right ? How to IG select attributes from vocabulary when we set numToSelect(10, 000); ?

Related

Excluding columns from training data in BigQuery ML

I am training a model using BigQuery ML, my input has several fields, one of which is a customer number, this number is not useful as a prediction feature, but I do need it in the final output so that I can reference which users scored high vs. low. How can I exclude this column from the model training without removing it completely?
Reading the docs the only way I can see to exclude columns is by adding it to input_label_cols which it clearly is not, or data_split_col which is not desirable.
You do not need include into model fields that not need to be part of model - not at all.
Rather, you need to include them during the prediction
For example in below model you have only 6 fields as input (carrier, origin, dest, dep_delay, taxi_out, distance)
#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
While in prediction you can have all extra fields available, like below (and you can put them in any position of SELECT - but note - predicted columns will go first:
#standardsql
SELECT * FROM ml.PREDICT(MODEL `cloud-training-demos.flights.ontime`, (
SELECT
UNIQUE_CARRIER, -- extra column
ORIGIN_AIRPORT_ID, -- extra column
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
LIMIT 5
))
Obviously input_label_cols and data_split_col are for different purposes
input_label_cols STRING The label column name(s) in the training data.
data_split_col STRING This option identifies the column used to split the data [into training and evaluation sets]. This column cannot be used as a feature or label, and will be excluded from features automatically.

Evaluation error while trying to build model in DSX when dataset has a feature column with unique values

I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.

How to perform automated removal of attributes from a set of a large number of attributes

So I've got a large dataset of 2304 numeric attributes and the class attribute), and I want to perform feature selection to remove misleading and redundant attributes. This is because I will be running discretization to make them nominal and then run Naïves Bayes on the dataset.
However, in the select attributes tab in Weka, it only lists them in ranking order. I know there is a remove filter in the preprocess tab but it only takes in a range or number of the attribute(s).
Is there was an automated way of removing these, because of such a large dataset?
In the Preprocess tab,
Choose the AttributeSelection filter (supervised attribute filter).
Configure evaluator and search as desired.
Apply.
This will only keep the ones that pass the filter (keeping the class attribute, of course).
If you like the result, save this as a new arff file.

How to compress class values in WEKA explorer

Hello I am new to WEKA and am using weka 3.6.10.
Sorry if the answer to this question is something obvious.
I have a dataset containing 10 attributes and one decision class. Decision class is composed of values {1,2,3,4}, is there a way to change configuration so that the values would be considered as {1} and {2,3,4}(binary) rather than each of the values separately without modifying the other attributes?
I had a look at the WEKA filter but did not find anything useful.
Thanks guys
Use an Unsupervised Attribute filter, e.g. the NumericToBinary filter. In the topmost field of the configuration dialog, enter the posistion of the "Decision class" attribute. If it is in the 8th column, enter 8.
The filter will create "dummy variable" columns for each unique value of this attribute. If there are 4 unique values, after applying this filter your dataset will have 4 additional columns. Remove 3 of them.

Attribute ranking in weka ends up 0 for all attributes

I am classifying 755 classes with 1024 attributes in weka. I suppose I need to select best attributes for better accuracy. I tried to select attributes using InfoGainAttributeEval and Ranker method but all the attributes ranked as '0'. I am not sure what is wrong. Any help is appreciated.
The "0" here is not the ranking of each attribute, but the InfoGain of each attribute.
When you apply InfoGainAttributeEval on Weka, the 3 columns corresponds to :
Information Gain | "id" of the attribute | name of the attribute.
What happened here might be that the information gain was too small and has been rounded.
But again, as stated by #CAFEBABE in the comments, without having your data it is hardly possible to be sure.
Edit : A post here indicates that the precision of the output is set to four decimal places, which seems to confirm the above hypothesis.

Resources