How should I train a SVM using julia? - machine-learning

Does anyone have experience of training a support vector machine (SVM) in Julia (1.4.1) ?
I tried the LIBSVM interface, but the example on the gituhub page gave an error :
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = convert(Vector, iris[:Species])
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);```
ERROR: MethodError: no method matching LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}, ::Array{Float64,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1})
Closest candidates are:
LIBSVM.SupportVectors(::Int32, ::Array{Int32,1}, ::Array{T,1}, ::AbstractArray{U,2}, ::Array{Int32,1}, ::Array{LIBSVM.SVMNode,1}) where {T, U} at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:18
LIBSVM.SupportVectors(::LIBSVM.SVMModel, ::Any, ::Any) at /home/benny/.julia/packages/LIBSVM/5Z99T/src/LIBSVM.jl:27

It looks like LIBSVM.jl documentation is rather outdated and package was not updated appropriately, so it worth an issue (or at least pull request to update README).
Error that you see is not related to the package itself, but the fact that in current versions of DataFrames.jl and RDatasets.jl labels column is no longer Vector (as it was at the time when LIBSVM.jl was developed) but CategoricalArray. You can avoid this problem by converting CategoricalArray to usual Vector{String}. Complete example looks like this
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:Species]))
# First dimension of input data is features; second is instances
instances = convert(Array, iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100
Alternatively, you can use MLJ.jl or ScikitLearn.jl
which should correctly wrap LIBSVM.jl on their own.

Oskin's answer is for an older version.
In the current version, it should be modified as,
using RDatasets, LIBSVM
using StatsBase, Printf # `mean` and `printf` are no longer in Base, and should be used explicitly
# Load Fisher's classic iris data
iris = dataset("datasets", "iris")
# LIBSVM handles multi-class data automatically using a one-against-one strategy
labels = string.(convert(Vector, iris[:,:Species]))
# First dimension of input data is features; second is instances
instances = Matrix(iris[:, 1:4])'
# Train SVM on half of the data using default parameters. See documentation
# of svmtrain for options
model = svmtrain(instances[:, 1:2:end], labels[1:2:end]);
# Test model on the other half of the data.
(predicted_labels, decision_values) = svmpredict(model, instances[:, 2:2:end]);
# Compute accuracy
#printf "Accuracy: %.2f%%\n" mean((predicted_labels .== labels[2:2:end]))*100

Related

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

Random Cut Forest model evaluation strategies (Confusion matrix,Accuracy and precision_recall_fscore

I am using AWS sagemker random cut forest algorithm to detect the anomalies.
import boto3
import sagemaker
containers = {
'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/randomcutforest:latest',
'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:latest',
'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/randomcutforest:latest',
'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/randomcutforest:latest',
'ap-southeast-1':'475088953585.dkr.ecr.ap-southeast-1.amazonaws.com/randomcutforest:latest'
}
region_name = boto3.Session().region_name
container = containers[region_name]
session = sagemaker.Session()
rcf = sagemaker.estimator.Estimator(
container,
sagemaker.get_execution_role(),
output_path='s3://{}/{}/output'.format(bucket, prefix),
train_instance_count=1,
train_instance_type='ml.c5.xlarge',
sagemaker_session=session)
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
s3_train_input = sagemaker.session.s3_input(
s3_train_data,
distribution='ShardedByS3Key',
content_type='application/x-recordio-protobuf')
rcf.fit({'train': s3_train_input})
( referred from --> https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/)
used above code to train the model, didn't find the way to evaluate model.
how to get the Accuracy and F score after deploying the model.
In order to get evaluation metrics you need to provide an extra channel called "test" during training. The test channel must contained labeled data. It is explained in the official documentation, https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html :
Amazon SageMaker Random Cut Forest supports the train and test data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either application/x-recordio-protobuf or text/csv formats. For the test data, when using text/csv format, the content must be specified as text/csv;label_size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as recordIO-wrapped-protobuf or as CSV
Also note ... the test channel only supports S3DataDistributionType=FullyReplicated
Thanks,
Julio

can we save a partially trained Machine Learning model, reload it again and train from the point it was saved?

I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.

Convert dgeMatrix for downstream tasks

I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way).
# create document term matrix
dtm = create_dtm(it, vectorizer)
# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
# normalise
dtm_averaged <- text2vec::normalize(dtm[, common_terms], "l1")
# compute average sentence embeddings
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
The resulting object is of dgeMatrix class, which is equivalent to matrix class as I understand. dgeMatrix class isn't used for many downstream tasks so I would like to convert the matrix. The object, however, is 6GB large, and I have problems converting the matrix to a data frame or even text file for further processing.
Ideally , I'd use this matrix in Spark for further analysis such as k-means clustering. My question what would be the best strategy to use the matrix for downstream tasks.
a) Convert to matrix class or data frame
b) write the matrix to file?
c) something completely different
I run the models on Google Cloud and have a machine with 32gb ram and 28 cpu.
Thanks for your help.

Using chi-squared test for feature selection

I always have trouble understanding the significance of chi-squared test and how to use it for feature selection. I tried reading the wiki page but I didn't get a practical understanding. Can anyone explain?
chi-squared test helps you to determine the most significant features among a list of available features by determining the correlation between feature variables and the target variable.
Example below is taken from https://chrisalbon.com/machine-learning/chi-squared_for_feature_selection.html
The below test will select two best features (since we are assigning 2 to the "k" parameter) among the 4 available features initially.
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
type(X_kbest)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
Chi-squared feature selection is a uni-variate feature selection technique for categorical variables. It can also be used for continuous variable, but the continuous variable needs to be categorized first.
How it works?
It tests the null hypothesis that the outcome class depends on the categorical variable by calculating chi-squared statistics based on contingency table. For more details on contingency table and chi-squared test check the video: https://www.youtube.com/watch?v=misMgRRV3jQ
To categorize the continuous data, there are range of techniques available from simplistic frequency based binning to advance approaches such as Minimum Description Length and entropy based binning methods.
Advantage of using chi-squared test on continuous variable is that it can capture the non-linear relation with outcome variable.

Resources