Applying custom costfunction in TensorFlow's SKFlow model training - machine-learning

I'm trying to make a regression model with TensorFlow while using the sklearn implementation so it plays nicely with all the other models I've made. However I cannot seem to find a way to train the model with a custom score function (cost function or objective function).
Is this simply impossible with skflow?
Thanks loads!

Many of the examples uses learn.models.logistic_regression, which is basically a built-in high-level model that returns predictions and losses. For example, models.logistic_regression uses ops.losses_ops.softmax_classifier, which means you can look into how ops.losses_ops.softmax_classifier is implemented and implement your own loss function using perhaps TensorFlow low-level APIs.

Related

How to access cluster labels from a fit method in AWS Sagemaker

Background information:
AWS Sagemaker offers the possibility to use external Sklearn clustering methods, like DBSCAN, as well as internal clustering methods like kmeans for fitting and deploying/predicting. By default you have access to a clustered labels after deploying your method as a predictor object:
Example:
kmeans_customers_3 = KMeans(role=role,
instance_count=1,
instance_type='ml.c4.xlarge',
output_path=output_path_cluster, # specified, above
k=3,
epochs=20,
sagemaker_session=sagemaker_session)
kmeans_customers_3.fit(some_data)
kmeans_predict_3 = kmeans_customers_3.deploy(
initial_instance_count=1,
instance_type="ml.t2.medium"
)
cluster_info=kmeans_predict_3.predict(aws_conform_data_in_record_set)
cluster_labels = [cluster.label['closest_cluster'].float32_tensor.values[0] for cluster in cluster_info]
Problem:
When using an external clustering method from sklearn, these methods mostly have no predict() function. E.g. Agglomerative Clustering or DBSCAN have only a fit() or fit_predict() method, which is not compatible with AWS deploying, only methods that have a predict method, like Kmeans or affinity clustering, work well with AWS (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
Question:
How can I access a fitted clutering model from AWS, so that I have access to model.class_labels attributes after fit (in hope of not only using clustering methods that have a predict method)? I now how to download the model.tar.gz but, I'm a bit confused what to do with it, since opening it does not help.
It could be also possible to write an own predict function for such a method, that only returns class labels, however, I dont know how to do that in this environment, since AWS uses an SKLEARN object, from which I dont believe I can overwrite or the method of e.g. DBSCAN itself.
Any ideas how to retrieve class labels of clustering methods from a.fit method in AWS Sagemaker?
Once your Sklearn model is trained and saved in S3 as a model.tar.gz, you can download it to the client of your choice, untar it and re-open it with the same libraries you used to save it (pickle, joblib, etc).
If you're looking for the way to open the model.tar.gz after training the model with the built-in KMeans SageMaker algorithm, check the Analyze US census data for population segmentation SageMaker example, in particular, the section Accessing the KMeans model attributes that has this code sample:
Kmeans_model_params = mx.ndarray.load("model_algo-1")
The code sample, which you provided in your question, is correct, if you want to calculate (predict) the labels for all data points in your dataset.
In another Bring Your Own Model (k-means) example there's a code on how to pack your own KMeans model, e. g. trained with sklearn.cluster.KMeans for the inference inside SageMaker built-in KMeans container, in particular, this code is the main part:
centroids = mx.ndarray.array(kmeans.cluster_centers_)
mx.ndarray.save("model_algo-1", [centroids])
If you're looking for the way to host another SKLearn model in SageMaker, you need to create an inference.py script and define predict_fn() and model_fn() as in the SageMaker scikit-learn Bring Your Own Model example.

Classification with Keras, unbalanced classes

I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot
A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.

Strategies to assign specific weights to training instances

I am working on a Machine Learning Classification Model in which the user can provide label instances that should help improve the model.
More relevance needs to be given to the latest instances given by the user than for those instances that were previously available for training.
In particular, I am developing my machine learning models in python using Sklearn libraries.
So far I've only found the strategy of oversampling particular instances as a possible solution to the problem. With this strategy I would create multiple copies of the instances for which I want to give higher relevance.
Other strategy that I've found, but it seems not help under these conditions is:
Strategies that focus on giving weights for each class. This strategy is highly used in multiple libraries like Sklearn by default. However, this generalizes the idea to a class level and doesn't help me to put focus on particular instances
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
I read some suggestions to multiple the loss function by some factors for instances in tensor flow models, but this seems to be mostly applicable to neural network models in Tensor flow.
I wonder if anyone has information of other approaches that might helps with this problem
I've look for multiple strategies that might help provide specific weights for individual instances but most have focused on class level instead of instance level weights.
This is not accurate; most scikit-learn classifiers provide a sample_weight argument in their fit methods, which does exactly that. For example, here is the documentation reference for Logistic Regression:
sample_weight : array-like, shape (n_samples,) optional
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Similar arguments exist for most scikit-learn classifiers, e.g. decision trees, random forests etc, even for linear regression (not a classifier). Be sure to check the SVM: Weighted samples example in the docs.
The situation is roughly similar for other frameworks; see for example own answer in Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?
What's more, scikit-learn also provides a utility function to compute sample_weight in cases of imbalanced datasets: sklearn.utils.class_weight.compute_sample_weight

why we use kmeans.fit function in kmeans clustering method?

I am using kmeans clustering technique from a video but i do not understand why we use .fit method in kmeans clustering?
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(X) //why we use this fit method here
kmeans is your defined model.
To train our model , we use kmeans.fit() here.
The argument in
kmeans.fit(argument)
is our data set that need to be Clustered.
After using the
fit() function
our model is ready.
And we get labels for that clusters using
data_labels = kmeans.labels_
Because the sklearn people early on decided that everything should have fit(X, y) and predict(X) functions. And it likely is not going to change, because of backwards compatibility...
It does not make a whole lot of sense for clustering, which does not use y (which defaults to None as it is ignored). And there is no real use case where you would want to drop-in replace a classifier with a clustering, either.
Nevertheless, you'll at some point need to run the algorithm. It is an anti-pattern to do this in a constructor (so KMeans(n_clusters=5, data=X) is a no-no), so you will have to invoke some method. You may as well call it fit then, which fits at least for optimization based methods such as k-means.
You could, however, simply use the method k_means(X, n_clusters=5) instead of using the class. Then it would be a single line (see the source code of fit for an example).

What is the logic behind the .fit() method in machine learning models?

I started machine learning with sci-kit learn and came across various models in machine learning.
In every model, there was a fit() function.
Although I read many blog posts and came to know that fit() helps us to find the parameter of the model.
For example in Linear Regression model, fit() function helps to find the slope and intercept.
But I am still not able to understand the behind logic of fit() function.
In general at least for predictive models, fit() takes data that you want to use to train some model so that it can make predictions about other related data. Each type of model has different constraints and different types of patterns it attempts to extract from the data. In one dimensional linear regression, fit() is looking for a linear (straight line) relationship in the data and finds a linear function (slope and intercept) that minimizes the sum of squared differences between the function and the data points provided.

Resources