How do i convert 30 categories into numbers for scikit - machine-learning

I am new to statistics, Python, machine learning and Scikit-learn. However, I am trying this project where I have a CSV with 35 columns of student data. The first column is an ID which I think I can ignore. The last 3 columns are the grade 1, grade 2 and grade 3 scores. I have 400 rows. I want to see if I can learn some machine learning with it, and make some sense of the data I have. Now I understand Scikit works on Numpy arrays which do not handle categorical data like sex ('male', 'female') and so on. So I codified all the 30 categories with 1 for male, 2 for female and so on and so forth. I then did the following
X = my_data[:,1:33]
y = my_data[:,34]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
expected = y
predicted = model.predict(X)
mse = np.mean((predicted-expected)**2)
print(mse)
print(model.score(X,y))
I got a MSE of 6.0839840461 and a model score of 0.709407474898.
I got some result. So far so good for a first attempt. However, I realized that since I assigned increasing code values like 1 for male, 2 for female, the Linear Regression would have treated them as weights. How do I replace the Gender column with [1,0] or [0,1], which I learn is the right way to represent categorical data? Would it be a dictionary type column or a list type column? If so how will it be part of the Numpy array?

This is called indicator or dummy variables, and Pandas allows to easily encode such categorical values:
>>> import pandas as pd
>>> pd.get_dummies(['male', 'female'])
female male
0 0 1
1 1 0
Don't forget about multicollinearity, though - algorithms like linear regression rely on independence of variables, while in your case female=0 definitely means male=1. In this case simply remove one dummy variable (e.g. use only female var and not male).

There is also a LabelEncoder() in sklearn.preprocessing package:
from sklearn import preprocessing
le1 = preprocessing.LabelEncoder()
y = le1.transform(y)
You can also inverse transform back with le1.inverse_transform(y).
The encoding is done automatically though, you cannot change the order.

Related

Normalize and de-normalizing data in prediction model

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

Correct labelling of confusion matrix axis

I have written the following code to generate a confusion matrix
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,Y_train)
sms2="REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode"
sms="You’ve Won!"
X_test = [str (item) for item in X_test]
Y_pred = mnb.predict(vec.transform(X_test))
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(Y_test, Y_pred)
print(mat)
names =[ "non-spam", "spam"]
print(names)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
xticklabels=names, yticklabels=names)
plt.xlabel('Actual [Truth]')
plt.ylabel('Predicted')
plt.show()
It generated the following confusion matrix:
I am not sure if the axis are labelled correctly
ie. If the x-axis should be Actual and y-axis should be predicted
or the other way around
According to the scikit=learn documentation, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html, the first call parameter should be the ground truth values, and the second call parameter should be the predicted values via the classifier.
If that is the case, then the return parameter for the confusion matrix should have the rows containing the true classes and the columns should be the predicted classes.
I think your labels are backward based on that. See also the example https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_display_object_visualization.html#sphx-glr-auto-examples-miscellaneous-plot-display-object-visualization-py

Logistic Regression with Non-Integer feature value

Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
Result
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
N.B
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
Result
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).

GridSearchCV scoring parameter: using scoring='f1' or scoring=None (by default uses accuracy) gives the same result

I'm using an example extracted from the book "Mastering Machine Learning with scikit learn".
It uses a decision tree to predict whether each of the images on a web page is an
advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. The data is publicly available from the Internet Advertisements Data Set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images.
The following is the complete code for completing the classification task:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random
def main(argv):
df = pd.read_csv('ad-dataset/ad.data', header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)
pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
if __name__ == '__main__':
main(sys.argv[1:])
The RESULTS of using scoring='f1' in GridSearchCV as in the example is:
The RESULTS of using scoring=None (by default Accuracy measure) is the same as using F1 score:
If I'm not wrong optimizing the parameter search by different scoring functions should yield different results. The following case shows that different results are obtained when scoring='precision' is used.
The RESULTS of using scoring='precision' is DIFFERENT than the other two cases. The same would be true for 'recall', etc:
WHY 'F1' AND None, BY DEFAULT ACCURACY, GIVE THE SAME RESULT??
EDITED
I agree with both answers by Fabian & Sebastian. The problem should be the small param_grid. But I just wanted to clarify that the problem surged when I was working with a totally different (not the one in the example here) highly imbalance dataset 100:1 (which should affect the accuracy) and using Logistic Regression. In this case also 'F1' and accuracy gave the same result.
The param_grid that I used, in this case, was the following:
parameters = {"penalty": ("l1", "l2"),
"C": (0.001, 0.01, 0.1, 1, 10, 100),
"solver": ("newton-cg", "lbfgs", "liblinear"),
"class_weight":[{0:4}],
}
I guess that the parameter selection is also too small.
I think that the author didn't choose this example very well. I may be missing something here, but min_samples_split=1 doesn't make sense to me: Isn't it the same as setting min_samples_split=2 since you can't split 1 sample -- essentially, it's a waste of computational time.
From the documentation: min_samples_split: "The minimum number of samples required to split an internal node."
Btw. this is a very small grid and there is not much choice anyways, which may explain why accuracy and f1 give you the same parameter combinations and hence the same scoring tables.
Like mentioned above, the dataset may be well balanced which is why F1 and accuracy scores may prefer the same parameter combinations. So, looking further at your GridSearch results using (a) F1 score and (b) Accuracy, I conclude that in both cases a depth of 150 works best. Since this is the lower boundary, it gives you a slight hind that lower "depth" values may work even better. However, I suspect that the tree doesn't even go that deep on this dataset (you can end up with "pure" leaves even well before reaching the max depth).
So, let's repeat the experiment with a little bit more sensible values using the following parameter grid
parameters = {
'clf__max_depth': list(range(2, 30)),
'clf__min_samples_split': (2,),
'clf__min_samples_leaf': (1,)
}
The optimal "depth" for the best F1 score seems to be around 15.
Best score: 0.878
Best parameters set:
clf__max_depth: 15
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.99 716
1 0.92 0.89 0.91 104
avg / total 0.98 0.98 0.98 820
Next, let's try it using "accuracy" (or None) as our scoring metric:
> Best score: 0.967
Best parameters set:
clf__max_depth: 6
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.98 716
1 0.93 0.85 0.88 104
avg / total 0.97 0.97 0.97 820
As you can see, you get different results now, and the "optimal" depth is different if you use "accuracy."
I don't agree that optimizing the parameter search by different scoring functions should yield necessarily different results necessarily. If your dataset is balanced (roughly same number of samples in each class), I would expect that model selection by accuracy and F1 would yield very similar results.
Also, have in mind that GridSearchCV optimizes over a discrete grid. Maybe using a thinner grid of parameters would yield the results that you are looking for.
On an unbalanced dataset use the "labels" parameter of the f1_score scorer to use only the f1 score of the class you are interested in. Or consider using "sample_weight".

Genetic algorithms: fitness function for feature selection algorithm

I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.
I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set.
Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.
What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.
Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?
This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.
The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.
Reducing to code is straightforward:
Calculate the covariance matrix of your dataset (first remove the
column that holds the response variable, i.e., probably the last
one). If your dataset is m x n (columns x rows), then this
covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance
matrix; this will give you the proportion of the total variability
in the response variable, contributed by that eigenvalue (each
eigenvalue corresponds to a feature, or column). [Note,
singular-value decomposition (SVD) is often used for this step, but
it's unnecessary--an eigenvalue decomposition is much simpler, and
always does the job as long as your matrix is square, which
covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of
candidate solutions (features subsets, in your case). The next task
in GA, or any combinatorial optimization, is to rank those candiate
solutions by their cost function score. In your case, the cost
function is a simple summation of the eigenvalue proportion for each
feature in that subset. (I guess you would want to scale/normalize
that calculation so that the higher numbers are the least fit
though.)
A sample calculation (using python + NumPy):
>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA
>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
(4, 4)
>>> C
array([[ 1. , -0.11, 0.87, 0.82],
[-0.11, 1. , -0.42, -0.36],
[ 0.87, -0.42, 1. , 0.96],
[ 0.82, -0.36, 0.96, 1. ]])
>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1)) # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))
ev value proportion
1 2.91 0.727
2 0.92 0.953
3 0.14 0.995
4 0.02 1.000
so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).

Resources