Recognize Shuffled dataset - machine-learning

I have two datasets, one of them is the real dataset and one of them is a randomized dataset
where the class attribute has been randomly shuffled. How can I determine which is
which? Thanks

Train a classifier. The data set where you can get a working classifier is probably the one with the real labels. On the shuffled one, no classifier should work!
There is no guarantee you can detect it. If your data was random before, it doesn't get more random by shuffling; so you cannot decide then. But if the data set had a nice structure before, then shuffling should usually destroy this.

Related

Should I retrain the model with the whole dataset after using a train-test split to find the best hyper parameters?

I split my dataset into training and testing. At the end after finding the best hyper parameters for the training dataset, should I fit the model again using all the data? The point is to reach the highest possible score for new data.
Yes, that would help to generalize your model, as more data generally means better generalization.
I don't think so. If you do that, you will no longer have a valid test set. What happens when you come back to improve the model later? If you do this, then you will need a new test set each model improvement, which means more labeling. You won't be able to compare experiments across model versions, because the test set won't be identical.
If you consider this model finished forever, then ok.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

How does model.fit process images?

I have a training dataset with images that looks like this:
x=[image1,image2...imageN]
and an output dataset that looks like this:
y=[output1,output2...]
I don't understand how does the model.fit works in regards to processing the images. Meaning, if I choose shuffle=False will the model take the first image first and go through the whole feedforward, backprop, etc. and compare it to output1, and then the second image, and so on?
Or does the model randomly select images from my dataset?
If you specify shuffle = True the generator will shuffle the dataset before each epoch. It will then go through the shuffled dataset one batch at a time, if it reaches then end before the next epoch it will go back to the start.
If you specify shuffle = False it will go through the dataset in the same order every epoch.
I believe a similar question is asked here.
shuffle in the model.fit of keras
As far as I know, your thought process is correct up to a certain extent. The model takes a random image from the dataset and the associated output for that index and then trains on it. Quite similar to using a random number to select an image from the batch, training it comparing with the output and then marking it as trained to avoid retraining on the same example.

What is the use of train_on_batch() in keras?

How train_on_batch() is different from fit()? What are the cases when we should use train_on_batch()?
For this question, it's a simple answer from the primary author:
With fit_generator, you can use a generator for the validation data as
well. In general, I would recommend using fit_generator, but using
train_on_batch works fine too. These methods only exist for the sake of
convenience in different use cases, there is no "correct" method.
train_on_batch allows you to expressly update weights based on a collection of samples you provide, without regard to any fixed batch size. You would use this in cases when that is what you want: to train on an explicit collection of samples. You could use that approach to maintain your own iteration over multiple batches of a traditional training set but allowing fit or fit_generator to iterate batches for you is likely simpler.
One case when it might be nice to use train_on_batch is for updating a pre-trained model on a single new batch of samples. Suppose you've already trained and deployed a model, and sometime later you've received a new set of training samples previously never used. You could use train_on_batch to directly update the existing model only on those samples. Other methods can do this too, but it is rather explicit to use train_on_batch for this case.
Apart from special cases like this (either where you have some pedagogical reason to maintain your own cursor across different training batches, or else for some type of semi-online training update on a special batch), it is probably better to just always use fit (for data that fits in memory) or fit_generator (for streaming batches of data as a generator).
train_on_batch() gives you greater control of the state of the LSTM, for example, when using a stateful LSTM and controlling calls to model.reset_states() is needed. You may have multi-series data and need to reset the state after each series, which you can do with train_on_batch(), but if you used .fit() then the network would be trained on all the series of data without resetting the state. There's no right or wrong, it depends on what data you're using, and how you want the network to behave.
Train_on_batch will also see a performance increase over fit and fit generator if youre using large datasets and don't have easily serializable data (like high rank numpy arrays), to write to tfrecords.
In this case you can save the arrays as numpy files and load up smaller subsets of them (traina.npy, trainb.npy etc) in memory, when the whole set won't fit in memory. You can then use tf.data.Dataset.from_tensor_slices and then using train_on_batch with your subdataset, then loading up another dataset and calling train on batch again, etc, now you've trained on your entire set and can control exactly how much and what of your dataset trains your model. You can then define your own epochs, batch sizes, etc with simple loops and functions to grab from your dataset.
Indeed #nbro answer helps, just to add few more scenarios, lets say you are training some seq to seq model or a large network with one or more encoders. We can create custom training loops using train_on_batch and use a part of our data to validate on the encoder directly without using callbacks. Writing callbacks for a complex validation process could be difficult. There are several cases where we wish to train on batch.
Regards,
Karthick
From Keras - Model training APIs:
fit: Trains the model for a fixed number of epochs (iterations on a dataset).
train_on_batch: Runs a single gradient update on a single batch of data.
We can use it in GAN when we update the discriminator and generator using a batch of our training data set at a time. I saw Jason Brownlee used train_on_batch in on his tutorials (How to Develop a 1D Generative Adversarial Network From Scratch in Keras)
Tip for quick search: Type Control+F and type in the search box the term that you want to search (train_on_batch, for example).

Can SVM solution change after shuffling the inputs?

When training a support vector machine (SVM) for classification with exactly the same data I obtain different results based on the order of the inputs, ie. if I shuffle the data I get different SVMs.
If I understood the theory correctly, the SVM solution should be the same regardless of the order of the inputs, so how come I get the different results? Is there any implementation "detail" in SVM why shuffling would change the solution? I have already checked my code several times, because I think this smells.
I use the SVM implementation in OpenCV.
EDIT: in this case, by shuffling I refer to changing the order of the data points not features.
I am not familiar with the OpenCV implementation. But do this: run several trials on exactly the same data set -- no shuffling, same order, same data points. See if the SVM changes. Obviously, in theory, it shouldn't. But it could be that there is some small randomization step somewhere in the implementation that produces different outputs for the same input.
Edit: As Chris A. asks, do the feature vectors correspond to their proper labels after shuffling? If not, that would obviously destroy your results.
SVM is for solving convex optimization problem, so maximum is unique. That means any random optimization algorithms will solve problem very close to unique optimal solution. And shuffling can't change result above float-point operation accuracy.

Resources