What is the use of 'n_splits' in StratifiedShuffleSplit from scikit learn - machine-learning

I have been reading the book Hands-on Machine Learning with Scikit-Learn and Tensorflow and I found this code:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
I would like to know what the argument 'n_splits' does. I searched everywhere but couldn't find a satisfactory response.
Thanks in advance!!

As the name suggests, the n_splits parameter is used to specify how many times (basically how many separate splits) you want the splits to happen.
For example, setting n_splits = 3 would make the loop generate 3 different splits (one for each iteration) so you can perform validation more effectively.
Setting n_splits = 1 would mimic what sklearn.model_selection.train_test_split would do (along with the stratify parameter mentioned). The documentation has detailed explanations of each parameter for this function.


is it possible to use LinearSVC model with OneVsRest in PySpark?

Im trying to use LinearSVC model in OneVsRest in PySpark , but it seems its not supported yet.
My error msg
LinearSVC only supports binary classification. 1 classes detected in LinearSVC_43a50b0b70d60a8cbdb1__labelCol
What kind of changes do i need in order to implement it in PySpark?
Does anyone know when will OneVsRest in Pyspark will support LinearSVC?
The error message tells you that your dataset contains only one class currently, but LinearSVM is a binary classification algorithm which requires exactly two classes.
I'm not sure if the rest of your code will cause any issues, because you haven't posted anything. Just in case you or someone else needs it, have a look below.
Like alrady said, LinearSVM is a binary classification algorithm which will never support multi class classification by definition but you can always reduce a multi class classification problem to a binary classification problem. One-vs-Rest is an approach for such a reduction. It trains one classifier per class and from an engineering perspective it makes sense to seperate this to a dedicated class like spark did. The OneVsRest trains one classifier for each of your classes and a given sample is scored against this list of classifiers. The classifier with the highstest score determines the predicted label for your sample.
Have a look at the code below for the usage of OneVsRest with LinearSVC:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import OneVsRest, LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
df = spark.read.csv('/tmp/iris.data', schema='sepalLength DOUBLE, sepalWidth DOUBLE, petalLength DOUBLE, petalWidth DOUBLE, class STRING')
vecAssembler = VectorAssembler(inputCols=["sepalLength", "sepalWidth", "petalLength", 'petalWidth'], outputCol="features")
df = vecAssembler.transform(df)
stringIndexer = StringIndexer(inputCol="class", outputCol="label")
si_model = stringIndexer.fit(df)
df = si_model.transform(df)
svm = LinearSVC()
ovr = OneVsRest(classifier=svm)
ovrModel = ovr.fit(df)
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
predictions = ovrModel.transform(df)
print("Accuracy: {}".format(evaluator.evaluate(predictions)))
Accuracy: 0.9533333333333334
This is a funny bug in PySpark. If you have multiple classes, they must be identified starting from zero.
I went through this bug just right now. I had a dataframe built in the same way they suggest in the LinearSVC guide.
df = sc.parallelize([
Row(label=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
Originally, it was a plain RDD, then I transformed each RDD record in a Row. I had a three-classes problem where classes were named 1, 2 and 3. I instantiated a OneVsRest object (just like #cronoik suggested) and I went through the same error as yours.
So I took the df dataframe exactly as initalised in their user guide (see above) and I decided to start playing with it by adding and removing classes. So I simply substituted in the second pattern label=0.0 with label=2.0 and the error appeared. Even with their dataframe, even with just two classes.
So I changed the naming of my classes from 1, 2, 3 to 0, 1, 2 and the error went away.
Hope this helps!

How the initial kmeans points works in to BigQuery ML?

I'm using BigQuery for machine learning, more specifically the k-means method for an unlabeled dataset where I'm trying to find clusters.
I'd like to know if someone has discovered how the BQ ML initiates the centroids.
I already tried looking at the documentation but either there is nothing or I couldn't find it.
CREATE MODEL `project.dataset.model_name`
model_type = "kmeans",
num_clusters = 3,
distance_type = "euclidean",
early_stop = TRUE,
max_iterations = 20,
standardize_features = TRUE)
(SELECT * FROM `project.dataset.sample_date_to_train`
The results differ a little each time I run.
Has someone experience with that subject?
For someone who is still looking for an answer, recently there has been an update on BigQuery ML about this topic. Two new paramaters have been added to the CREATE MODEL statement, i.e.:
Basically you can set your custom K observations (belonging to the data table) that will serve as initial centroids for your K-means algorithm. You can find the relative documentation at this link. Maybe it's not the most exciting solution to your problem, but it's still something you can work with if you need reproducibility.
If I had to guess, it probably uses a similar logic to TensorFlow (BQML might be using TF under the hood as it is). Random partitioning seems to be the TensorFlow default, so that would be my guess.
The reason you are seeing different results each time you train the model, is due to the random nature of the initial values assigned to the centroids. The K-means algorithm begins by randomly selecting a value(position) for the k number of centroids chosen. If you review this documentation it explains the exact process when using the K-means algorithm1.

How to implement a sequence classification LSTM network in CNTK?

I'm working on implementation of LSTM Neural Network for sequence classification. I want to design a network with the following parameters:
Input : a sequence of n one-hot-vectors.
Network topology : two-layer LSTM network.
Output: a probability that a sequence given belong to a class (binary-classification). I want to take into account only last output from second LSTM layer.
I need to implement that in CNTK but I struggle because its documentation is not written really well. Can someone help me with that?
There is a sequence classification example that follows exactly what you're looking for.
The only difference is that it uses just a single LSTM layer. You can easily change this network to use multiple layers by changing:
LSTM_function = LSTMP_component_with_self_stabilization(
embedding_function.output, LSTM_dim, cell_dim)[0]
num_layers = 2 # for example
encoder_output = embedding_function.output
for i in range(0, num_layers):
encoder_output = LSTMP_component_with_self_stabilization(encoder_output.output, LSTM_dim, cell_dim)
However, you'd be better served by using the new layers library. Then you can simply do this:
encoder_output = Stabilizer()(input_sequence)
for i in range(0, num_layers):
encoder_output = Recurrence(LSTM(hidden_dim)) (encoder_output.output)
Then, to get your final output that you'd put into a dense output layer, you can first do:
final_output = sequence.last(encoder_output)
and then
z = Dense(vocab_dim) (final_output)
here you can find a straightforward approach, just add the additional layer like:
Recurrence(LSTM(hidden_dim), go_backwards=False),
Recurrence(LSTM(hidden_dim), go_backwards=False),
Dense(label_dim, activation=sigmoid)
train it, test it and apply it...
CNTK published a hands-on tutorial for language understanding that has an end to end recipe:
This hands-on lab shows how to implement a recurrent network to process text, for the Air Travel Information Services (ATIS) task of slot tagging (tag individual words to their respective classes, where the classes are provided as labels in the training data set). We will start with a straight-forward embedding of the words followed by a recurrent LSTM. This will then be extended to include neighboring words and run bidirectionally. Lastly, we will turn this system into an intent classifier.
I'm not familiar with CNTK. But since the question has been left unanswered for so long, I can perhaps suggest some advice to help you with the implementation?
I'm not sure how experienced you are with these architectures; but before moving to CNTK (which seemingly has a less active community), I'd suggest looking at other popular repositories (like Theano, tensor-flow, etc.)
For instance, a similar task in theano is given here: kyunghyuncho tutorials. Just look for "def lstm_layer" for the definitions.
A torch example can be found in Karpathy's very popular tutorials
Hope this helps a bit..

Classification with numerical label?

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

SciKit Learn feature selection and cross validation using RFECV

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).
