Results vary with size of dataset in classification

Results vary with size of dataset in classification - machine-learning

I am classifying data using a trained model and the results vary with size. e.g. suppose I have n rows initially and classify them and get a set of results X. Now if I add m rows to the previous dataset and have n+m rows and classify it then the results are different for first n rows also. And yes the change is not negligible. Please if anyone can provide an insight into this. Please let me know if the question is not clear. I am using R and the classifier is SVM.

If I understood you correctly the reason would be because an SVM model is a representation of all your sample as points in space.
Just from Wikipedia:
That means all your data is mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as
possible.
All your examples are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Since all the data is mapped, a new dataset could mean a new division, affecting your final result.

Related

Using PCA trained on a large data set for a smaller data set

Can I use a pca subspace trained on, say, eight features and one thousand time points to evaluate a single reading? That is, if I keep, say, the top six components, my transformation matrix will be 8x6 and using this to transform test data that is the same size as the training data would give me an 6x1000 vector.
But what if I want to look for anomalies at each time point independently? That is, can rather than use an 8x1000 test set, can I use 1000 separate transformation on 8x1 dimensional test vectors and get the same result? This vector will get transformed into the exact same spot as if it were the first row in a much larger data matrix, but the distance of that one vector from the principal axis doesn't appear to be meaningful. When I perform this same procedure on the truncated reference data, this distance isn't zero either, only the sum of all distances over the entire reference data set is zero. So if I can't show that the reference data is not "anomalous", how can I use this on test data?
Is it the case that the size of the data "object" used to train pca is the size of object that can be evaluated with it?
Thanks for any help you can give.

Possible/maybe category in deep learning

I'm interested in taking advantage of some partially labeled data that I have in a deep learning task. I'm using a fully convolutional approach, not sampling patches from the labeled regions.
I have masks that outline regions of definite positive examples in an image, but the unmasked regions in the images are not necessarily negative - they may be positive. Does anyone know of a way to incorporate this type of class in a deep learning setting?
Triplet/contrastive loss seems like it may be the way to go, but I'm not sure how to accommodate the "fuzzy" or ambiguous negative/positive space.

Try label smoothing as described in section 7.5.1 of Deep Learning book:
We can assume that for some small constant eps, the training set label y is correct with probability 1 - eps, and otherwise any of the other possible labels might be correct.
Label smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of eps / k and 1 - (k - 1) / k * eps, respectively.
See my question about implementing label smoothing in Pandas.
Otherwise if you know for sure, that some areas are negative, other are positive while some are uncertain, then you can introduce a third uncertain class. I have worked with data sets that contained uncertain class, which corresponded to samples that could belong to any of the available classes.

I'm assuming that you are struggling with a data segmantation task with a problem of a ill-definied background (e.g. you are not sure if all examples are correctly labeled). Recently I came across the similiar problem and this is what I came across during my research:
In old days before deep learning and at the begining of deep learning era - the common way to deal with that is to smooth your output with some kind of a probability model which would take into account the possibility of a noisy labels (you could read about this in a Learning to Label from Noisy Data chapter from this book. It's important to discriminate this probabilistic models from models used to smooth your labels w.r.t. to image or label structure like classical CRFs for bilateral smoothing.
What we finally used (and worked really well) is the Channel Inhibited Softmax idea from this paper. In terms of a mathematical properties - it makes your network much more robust to some objects not labeled - because it makes your network to output much higher positive valued logits at correctly labeled objects.

You could treat this as a semi-supervised problem. Use the full dataset without labels to train a bottleneck autoencoder structure (or a GAN approach). This pretrained model can then be adjusted (e.g. removing the last layers, adding a better layer structure at the end on top of the bottleneck features) and finetuned on the labeled data.

Generating model file in LIBSVM

I am working on bioinformatics. I am having a data set of Amino Acid composition sequences.I want to classify these amino acid composition sequences in positive and negative class using SVM algorithm. I am using libsvm Tool for classification of amino acid sequences.Dataset which i have that contains 3909 rows. But when i am applying svm-train function of libsvm for generating the model file then the model file which is being generated contains 2233 row. So the actual dimension of my dataset is being reduced from 3909 to 2233. I am not getting that why this is happening..?? Kindly help me out.

The model retains only the support vectors needed to define the classes. Frankly, I'm surprised that it retained so many of the original rows.
Your terminology is not correct. "Dimension" is the number of features (columns), not the number of rows. The dimensionality has not been reduced. One way to think of this is that it took 2233 of the observations to define the entire border between positive and negative. The other 1694 points are "behind" other data points, farther from the border.
For a really simple example, consider all integers as the data points. We classify them simply: all points larger than pi (3.14159...) are in the positive set; all smaller ones are marked negative. Feed this to the SVM algorithm -- and what you get back is only two rows: 3 is negative; 4 is positive. All other points are "behind" one or the other.
Does that help?

Class labels in data partitions

Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset - say some records with label "x" appear only in validation set and not in the training.
Is this the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.
The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?

This is what stratified sampling is supposed to solve.
https://en.wikipedia.org/wiki/Stratified_sampling

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards

SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Results vary with size of dataset in classification - machine-learning

Related

Using PCA trained on a large data set for a smaller data set

Possible/maybe category in deep learning

Generating model file in LIBSVM

Class labels in data partitions

Interpreting a Self Organizing Map

Categories

Resources