Differnce between train_test_split and StratifiedShuffleSplit - machine-learning

I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit.
When stratify is not None train_test_split uses StratifiedShuffleSplit internally,
I was just wondering why the StratifiedShuffleSplit from sklearn.model_selection is used when we can use the stratify argument available in train_test_split.

Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit, train_test_split just calls that class.
For the same reason, when stratify=False, it uses the model_selection.ShuffleSplit class (see source code).
Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.
Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV.
The StratifiedShuffleSplit does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split that yields (train, test) splits as array of indices.
More info here (see parameter cv).

Related

PyTorch optimizer not reading parameters from my Model class dict

I'm utilizing the pyro package (pyro = combo of Python and PyTorch, https://pyro.ai/examples/normalizing_flows_i.html) for ML on Google Colab to try to do normalizing flows. When I try to set up the optimizer (using Adam), it tells me: type object 'NFModel' has no attribute 'params'.
Effectively I'm trying to use pyro's features to make a neural net with a few layers. The only failure point I have now for my model is the optimizer.
Class Definition Image:
class_def
Fails at: optimizer = torch.optim.Adam([{'params': NFModel.params.hiddenlayers.parameters()}], lr=LR)
As an aside, the reason for using pyro is normalizing flows require bijective transformations, i.e. if we consider the NN to be F, then PDF(x)=PDF(F(x))(F')^(-1). Pyro has this setup and recreating it would be otherwise way too cumbersome.
I looked up what caused previous failures for others and that was them not calling nn.Model when they built out their class, but I've added that
I've tried different combinations of things in the Adam line.
I've tried toying with the params piece in the class definition.

What is the use of 'n_splits' in StratifiedShuffleSplit from scikit learn

I have been reading the book Hands-on Machine Learning with Scikit-Learn and Tensorflow and I found this code:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
I would like to know what the argument 'n_splits' does. I searched everywhere but couldn't find a satisfactory response.
Thanks in advance!!
As the name suggests, the n_splits parameter is used to specify how many times (basically how many separate splits) you want the splits to happen.
For example, setting n_splits = 3 would make the loop generate 3 different splits (one for each iteration) so you can perform validation more effectively.
Setting n_splits = 1 would mimic what sklearn.model_selection.train_test_split would do (along with the stratify parameter mentioned). The documentation has detailed explanations of each parameter for this function.

is it possible to use LinearSVC model with OneVsRest in PySpark?

Im trying to use LinearSVC model in OneVsRest in PySpark , but it seems its not supported yet.
My error msg
LinearSVC only supports binary classification. 1 classes detected in LinearSVC_43a50b0b70d60a8cbdb1__labelCol
What kind of changes do i need in order to implement it in PySpark?
Does anyone know when will OneVsRest in Pyspark will support LinearSVC?
The error message tells you that your dataset contains only one class currently, but LinearSVM is a binary classification algorithm which requires exactly two classes.
I'm not sure if the rest of your code will cause any issues, because you haven't posted anything. Just in case you or someone else needs it, have a look below.
Like alrady said, LinearSVM is a binary classification algorithm which will never support multi class classification by definition but you can always reduce a multi class classification problem to a binary classification problem. One-vs-Rest is an approach for such a reduction. It trains one classifier per class and from an engineering perspective it makes sense to seperate this to a dedicated class like spark did. The OneVsRest trains one classifier for each of your classes and a given sample is scored against this list of classifiers. The classifier with the highstest score determines the predicted label for your sample.
Have a look at the code below for the usage of OneVsRest with LinearSVC:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import OneVsRest, LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
df = spark.read.csv('/tmp/iris.data', schema='sepalLength DOUBLE, sepalWidth DOUBLE, petalLength DOUBLE, petalWidth DOUBLE, class STRING')
vecAssembler = VectorAssembler(inputCols=["sepalLength", "sepalWidth", "petalLength", 'petalWidth'], outputCol="features")
df = vecAssembler.transform(df)
stringIndexer = StringIndexer(inputCol="class", outputCol="label")
si_model = stringIndexer.fit(df)
df = si_model.transform(df)
svm = LinearSVC()
ovr = OneVsRest(classifier=svm)
ovrModel = ovr.fit(df)
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
predictions = ovrModel.transform(df)
print("Accuracy: {}".format(evaluator.evaluate(predictions)))
Output:
Accuracy: 0.9533333333333334
This is a funny bug in PySpark. If you have multiple classes, they must be identified starting from zero.
I went through this bug just right now. I had a dataframe built in the same way they suggest in the LinearSVC guide.
df = sc.parallelize([
Row(label=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
Originally, it was a plain RDD, then I transformed each RDD record in a Row. I had a three-classes problem where classes were named 1, 2 and 3. I instantiated a OneVsRest object (just like #cronoik suggested) and I went through the same error as yours.
So I took the df dataframe exactly as initalised in their user guide (see above) and I decided to start playing with it by adding and removing classes. So I simply substituted in the second pattern label=0.0 with label=2.0 and the error appeared. Even with their dataframe, even with just two classes.
So I changed the naming of my classes from 1, 2, 3 to 0, 1, 2 and the error went away.
Hope this helps!

model.predict_classes vs model.predict_generator in keras

I understand that predict_generator outputs probabilities. To get the class, I just then find the index for the greatest probability and that will be the most probable class. However I find that after doing this, I get a different output than if I were to call predict_classes. I do not understand why. Can someone explain this please?
Generator in Keras uses glob to list folders which are alphabetically sorted, you can get classes being used during training using
# save classes to JSON
class_json = json.dumps(train_generator.class_indices)
with open("class.json", "w") as class_file:
class_file.write(class_json)
The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.
Another possibility if this is being done on images is not to use rescale=1./255 in ImageDataGenerator as mentioned in https://github.com/fchollet/keras/issues/3477
Hope that help!

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

Resources