I have made a good linear regression model with following step:
Data Integration
Data normalization/scaling(data preprocessing & feature engineering)
Model Building(using linear regression with SGD using cross validation)
Testing
My question is if we use this model in production environment then how can we do feature engineering of real time data because this model is built with feature normalization and scaling so how real time data can be normalized and scaled to get a good prediction? We don't need explicit feature engineering for cross validation and testing step because this can be done in data preprocessing step before building a model. What about real time data feature engineering?
This data could lend itself quite nicely to Featuretools. It is an open-source automated feature engineering library that explicitly deals with time to make sure you don't introduce label leakage.
For your music data, you could create two entities: "users" and "artist_plays", and then apply featuretools.dfs (Deep Feature Synthesis) to generate features. Think of an entity as being the same as a table in a relational database. Deep Feature Synthesis creates a single-table feature matrix ready for modeling from multiple different tables complete with high-level statistical features. Here is a short post explaining how it works.
This example is using plain Python, but could be adapted for use with Spark or Dask
# Create entityset
import featuretools as ft
from sklearn.preprocessing import Imputer, StandardScaler
import pandas as pd
import pickle
def load_entityset(user_df, artist_plays_df):
es = ft.EntitySet("artist plays")
es.entity_from_dataframe("users", user_df, index="user_id")
es.entity_from_dataframe("artist_plays", artist_plays_df, index="artist_id")
es.add_relationship(ft.Relationship(es['users']['user_id'], es['artist_plays']['user_id']))
return es
user_df = pd.read_csv("training_user.csv")
artist_plays_df = pd.read_csv("training_artist_plays.csv")
es = load_entityset(user_df, artist_plays_df)
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity='artist_plays',
ignore_variables={'artist_plays': ['play']})
# encode categoricals
encoded_fm, encoded_fl = ft.encode_features(feature_matrix, feature_defs)
# Impute/scale using SKLearn
imputer = Imputer()
scaler = StandardScaler()
imputed_fm = imputer.fit_transform(encoded_fm)
scaled_fm = scaler.fit_transform(imputed_fm)
# Now, save the encoded feature list, and the imputer/scaler to files to reuse in production
ft.save_features(encoded_fl, 'fl.p')
with open('imputer.p', 'wb') as f:
pickle.dump(imputer, f)
with open('scaler.p', 'wb') as f:
pickle.dump(scaler, f)
Then in production:
import featuretools as ft
import pickle
import pandas as pd
# load previous data
old_user_df = pd.read_csv("training_user.csv")
old_artist_plays_df = pd.read_csv("training_artist_plays.csv")
es_old = load_entityset(old_user_df, old_artist_plays_df)
# load new data
user_df = pd.read_csv("new_user.csv")
artist_plays_df = pd.read_csv("new_artist_plays.csv")
es_updated = load_entityset(user_df, artist_plays_df)
# merge both data sources
es = es_old.concat(es_updated)
# load back in encoded features
features = ft.load_features('fl.p', es)
fm = ft.calculate_feature_matrix(features,
entityset=es,
instance_ids=es_updated['artist_plays'].get_all_instances())
# impute and scale
with open('imputer.p', 'r') as f:
imputer = pickle.load(f)
imputed_fm = imputer.transform(fm)
with open('scaler.p', 'r') as f:
scaler = pickle.load(f)
scaled_fm = scaler.transform(imputed_fm)
We have a few demos using this workflow, check out this example to predict what a grocery shopper will buy in the future.
I have also used this workflow in a real time production environment to do prediction of delivery metrics for large scale software projects- check out this white paper we have published that goes through this method and the results in a live deployment in gory detail.
Related
Does anyone have experience of training a support vector machine (SVM) in Julia (1.4.1) ?
I tried the LIBSVM interface, but the example on the gituhub page gave an error :
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = convert(Vector, iris[:Species])
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);```
ERROR: MethodError: no method matching LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}, ::Array{Float64,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1})
Closest candidates are:
LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::Array{T,1}, ::AbstractArray{U,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1}) where {T, U} at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:18
LIBSVM.SupportVectors(::LIBSVM.SVMModel, ::Any, ::Any) at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:27
It looks like LIBSVM.jl documentation is rather outdated and package was not updated appropriately, so it worth an issue (or at least pull request to update README).
Error that you see is not related to the package itself, but the fact that in current versions of DataFrames.jl and RDatasets.jl labels column is no longer Vector (as it was at the time when LIBSVM.jl was developed) but CategoricalArray. You can avoid this problem by converting CategoricalArray to usual Vector{String}. Complete example looks like this
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:Species]))
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100
Alternatively, you can use MLJ.jl or ScikitLearn.jl
which should correctly wrap LIBSVM.jl on their own.
Oskin's answer is for an older version.
In the current version, it should be modified as,
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:,:Species]))
# First dimension of input data is features; second is instances
instances = Matrix(iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100
sklearn contains Implementation of different feature selection methods (filter/wrapper/embedded).
All those methods designed for static systems.
Does sklearn supports feature selection on dynamic data ? (Data which vary with time)
In dynamic data, we need to improve the efficiency of feature selection, in order to be more effective.
I found some methods on IEEE (Incremental approaches for feature selection),
So is there any implementation at sklearn or other open-source library ?
Couldn't you just re-run your process on a scheduled basis and load your data dynamically? I wouldn't expect the dependent variable to change at all, but I suppose the independent variables could change somewhat.
#1) load your dataframe
#2) copy your target variable into a new dataframe
y = df[['SeriousDlqin2yrs']]
#3) drop your target variable
x = df[df.columns[df.columns!='SeriousDlqin2yrs']]
Finally, run this.
from sklearn.ensemble import RandomForestClassifier
features = np.array(x)
clf = RandomForestClassifier()
clf.fit(x, y)
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
I just tried that, and got this result.
If you expect to get non-numeric features, you will need to use one hot encoding to handle these.
import pandas as pd
pd.get_dummies(df)
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.
I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.
I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).