Classification with numerical label? - machine-learning

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?

You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

Related

ML accuracy for a particular group/range

General terms that I used to search on google such as Localised Accuracy, custom accuracy, biased cost functions all seem wrong, and maybe I am not even asking the right questions.
Imagine I have some data, may it be the:
The famous Iris Classification Problem
Pictures of felines
The Following Dataset that I made up on predicting house prices:
In all these scenario, I am really interested in the accuracy of one set/one regression range of data.
For irises, I really need Iris "setosa" to be classified correctly, really don't care if Iris virginica and Iris versicolor are all wrong.
for Felines, I really need the model to tell me if you spotted a tiger (for obvious reason), whether it is a Persian or ragdoll or not I dont really care.
For the house prices one, i want the accuracy of higher-end houses error to be minimised. Because error in those is costly.
How do I do this? If I want Setosa to be classified correctly, removing virginica or versicolour both seem wrong. Trying different algorithm like Linear/SVM etc are all well and good, but it only improves the OVERALL accuracy. But I really need, for example, "Tigers" to be predicted correctly, even at the expense of the "overall" accuracy of the model.
Is there a way to have a custom cost-function to allow me to have a high accuracy in a localise region in a regression problem, or a specific category in a classification problem?
If this cannot be answered, if you could just point me to some terms that i can search/research that would still be greatly appreciated.
You can use weights to achieve that. If you're using the SVC class of scikit-learn, you can pass class_weight in the constructor. You could also pass sample_weight in the fit-method.
For example:
from sklearn import svm
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = svm.SVC(class_weight={0: 3, 1: 1, 2: 1})
clf.fit(X, y)
This way setosa is more important than the other classes.
Example for regression
from sklearn.linear_model import LinearRegression
X = ... # features
y = ... # house prices
weights = []
for house_price in y:
if house_price > threshold:
weights.append(3)
else:
weights.append(1)
clf = LinearRegression()
clf.fit(X, y, sample_weight=weights)

Derive the right k in k-means clustering (including k = 1) in pyspark

I want to check if a clustering would be helpful or not on my coordinates.
I'm dealing with trajectories and want to check if all of them are starting on a same area (the trajectories are different). Thus the aim here is to characterise the most frequent departure points.
However, sometimes there is no need for clustering. I'm using K-means here. I had thought of using the Silhouette Score but I don't see if it is mathematically correct for the case where there is only one cluster. DBScan will not be a good clustering as density are not similar in the clusters I wanted to build.
Would you have an idea to create a kind of check between k=1 and k=3, which would be the best split for my data? I'm dealing here with data with coordinates (latitude/longitude) where the starting point is not 100% fixed but can vary within 2km around a kind of barycentre.
Simple extract with k=2 :
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "lon"], outputCol="features")
df1= vecAssembler.transform(df)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df1.select('features'))
# Make predictions
transformed = model.transform(df1)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
metricName='silhouette', distanceMeasure='squaredEuclidean')
evaluator.evaluate(transformed)
Is there a way to compute in pySpark a case with k=1 ? in order to derive Elbow or gap statistics ?

Logistic Regression with Non-Integer feature value

Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
Result
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
N.B
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
Result
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

SciKit Learn feature selection and cross validation using RFECV

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).

Resources