what does lightgbm python Dataset reference parameter mean? - machine-learning

I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!

The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".

still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)

Related

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

How to get jlibsvm prediction probability in multi-class classification

I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.

Transformer of a pipeline acts on the training data instead of the whole data in GridSearchCV

I am using pipeline and GridSearchCV to select features automatically. Since the data set is small, I set the parameter 'cv' in GridSearchCV to StratifiedShuffleSplit. The code looks like as follows:
selection = SelectKBest()
clf = LinearSVC()
pipeline = Pipeline([("select", selection), ("classify", clf)])
cv = StratifiedShuffleSplit(n_splits=50)
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv = cv)
grid_search.fit(X, y)
It seems that SelectKBest acts on the training data of each split instead of the whole data set (the latter is what I want) since the result becomes different if I separate the 'select' and 'classify', where the StratifiedShuffleSplit will for sure only act on the classifier.
What is the correct way of using pipeline and GridSearchCV in this case? Thanks a lot!
Cross-validating the whole pipeline (i.e. running SelectKBest only on training part of each split) is the way to go. Otherwise the model is allowed to look at the test part - it means quality estimates become wrong. Best hyperparameters found with these unfair quality estimates may also work worse on a real unseen data.
At prediction time you're not going to re-run SelectKBest on (training dataset + example to be predicted) and then re-train the classifier, why would you do that in evaluation?
You can also find the answer of this question from page 245 to page 246 in the book-"The elements of statistical learning (2rd edition)" by Hastie,etc.

Sentiment Analysis classifier using Machine Learning

How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.
There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.

what's meaning of function predict's returned value in OpenCV?

I use function predict in opencv to classify my gestures.
svm.load("train.xml");
float ret = svm.predict(mat);//mat is my feature vector
I defined 5 labels (1.0,2.0,3.0,4.0,5.0), but in fact the value of ret are (0.521220207,-0.247173533,-0.127723947······)
So I am confused about it. As Opencv official document, the function returns a class label (classification) in my case.
update: I don't still know why to appear this result. But I choose new features to train models and the return value of predict function is what I defined during train phase (e.g. 1 or 2 or 3 or etc).
During the training of an SVM you assign a label to each class of training data.
When you classify a sample the returned result will match up with one of these labels telling you which class the sample is predicted to fall into.
There's some more documentation here which might help:
http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
With Support Vector Machines (SVM) you have a training function and a prediction one. The training function is to train your data and save those informations on an xml file (it facilitates the prediction process in case you use a huge number of training data and you must do the prediction function in another project).
Example : 20 images per class in your case : 20*5=100 training images,each image is associated with a label of its appropriate class and all these informations are stocked in train.xml)
For the prediction function , it tells you what's label to assign to your test image according to your training DATA (the hole work you did in training process). Your prediction results might be good and might be bad , it's all about your training data I think.
If you want try to calculate the error rate for your classifier to see how much it can give good results or bad ones.

Resources