Improving K Means on some data sets - machine-learning

Anyone got an idea on how a simple K-means algorithm could be tuned to handle data sets of this form.

The most direct way to handle data of that form while still using k-means it to use a kernelized version of k-means. 2 implemtations of it exist in the JSAT library (see here https://github.com/EdwardRaff/JSAT/blob/67fe66db3955da9f4192bb8f7823d2aa6662fc6f/JSAT/src/jsat/clustering/kmeans/ElkanKernelKMeans.java)
As Nicholas said, another option is to create a new feature space on which you run k-means. However this takes some prior knowledge of what kind of data you will be clustering.
After that, you really just need to move to a different algorithm. k-means is a simple algorithm that makes simple assumptions about the world, and when those assumptions are too strongly violated (non linearly separable clusters being one of those assumptions) then you just have to accept that and pick a more appropriate algorithm.

One possible solution to this problem is to add another dimension to your data set, for which there is a split between the two classes.
Obviously this is not applicable in many cases, but if you have applied some sort of dimensionality reduction to your data, then it may be something worth investigating.

Related

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

How to clarify which model layers to use for machine learning?

We are currently doing a little experiment with machine learning with Deeplearning4j.
We have voltage measurements in time series from different devices that I know that depends on each other.
We manage to labeling huge amount of those data with one and zeroes.
Our problem is to figure out the use of layers for the model.
For us it seems that it is experience that it is used among people and examples seems to be random.
We currently using LSTM and RNN
But how can we clarify if there is better models?
We would like to see if the model can figure out some dependencies through predictions that we haven’t noticed.
The best way to go about this, is to start by looking at your data and what you want to get out of it. Then you should start out by setting up a base line. Use the simplest possible modelling technique you are familiar with just so you have anything at all.
In your case it looks like you have a label for each timestep. So, you might just use simple linear regression for each timestep separately to get a feel for what you would get if you don't incorporate any sequence information at all. Anything that works fast is a good candidate for this step.
Once you have that baseline, you can start looking at building a deeplearning model that outperforms this baseline.
For time series data, you have two options at the moment in DL4J, either you use a recurrent layer like LSTM, or you use convolutions over time.
If you want to have an output at each timestep, then a recurrent layer is probably better for you. The convolutional approach usually works best if you want to have just a single result after reading in the whole sequence.
For choosing how wide those layers should be, and how many layers you should use, you will have to experiment a bit.
The first thing that you want to achieve is to build a model that can over-fit on a subset of your data. So you start out, by passing in only a single batch of examples over and over again. If the model can't overfit on that, you make the layers wider. If the layers start getting too wide, you add another layer on top.
If you use the deeplearning4j-ui module, it will tell you how many parameters your model currently has. They should usually be less than the number of total examples you have, or you risk overfitting on your full data set.
As soon as you can train a model to overfit on a small subset of your data, you can start training it with all of your data.
At that point you can then start looking into finding better hyperparameters and seeing by how much you can beat your baseline.

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

Feature combination/joint features in supervised learning

While trying to come up with appropriate features for a supervised learning problem I had the following idea and wondered if it makes sense and if so, how to algorithmically formulate it.
In an image I want to classify two regions, i.e. two "types" of pixels. Say I have some bounded structure, let's take a circle, and I know I can limit my search space to this circle. Within that circle I want to find a segmenting contour, i.e. a contour that separates my pixels into an inner class A and an outer class B.
I want to implement the following model:
I know that pixels close to the bounding circle are more likely to be in the outer class B.
Of course, I can use the distance from the bounding circle as a feature, then the algorithm would learn the average distance of the inner contour from the bounding circle.
But: I wonder if I can exploit my model assumption in a smarter way. One heuristic idea would be to weigh other features by this distance, so to say, if a pixel further away from the bounding circle wants to belong to the outer class B, it has to have strongly convincing other features.
This leads to a general question:
How can one exploit joint information of features, that were prior individually learned by the algorithm?
And to a specific question:
In my outlined setup, does my heuristic idea make sense? At what point of the algorithm should this information be used? What would be recommended literature or what would be buzzwords if I wanted to search for similar ideas in the literature?
This leads to a general question:
How can one exploit joint information of features, that were prior individually learned by the algorithm?
It is not really clear what you are really asking here. What do you mean by "individually learned by the algorithm" and what would be "joiint information"? First of all, problem is too broad, there is no such tring as "generic supervised learning model", each of them works in at least slightly different way, most falling into three classes:
Building a regression model of some kind, to map input data to the output and then agregate results for classification (linear regression, artificial neural networks)
Building geometrical separation of data (like support vector machines, classification-soms' etc.)
Directly (more or less) estimating probability of given classes (like Naive Bayes, classification restricted boltzmann machines etc.)
in each of them, there is somehow encoded "joint information" regarding features - the classification function is their joint information. In some cases it is easy do interpret (linear regression) and in some it is almost impossible (deep boltzmann machines, generally all deep architectures).
And to a specific question:
In my outlined setup, does my heuristic idea make sense? At what point of the algorithm should this information be used? What would be recommended literature or what would be buzzwords if I wanted to search for similar ideas in the literature?
To my best knowledge this concept is quite doubtfull. Many models tends to learn and work better, if your data is uncorrelated, while you are trying to do the opposite - correlate everything with some particular feature. This leads to one main concern - why are you doing this? To force model to use mainly this feature?
If it is so important - maybe a supervised learning is not the good idea, maybe you can directly model your problem by appling set of simple rules based on this particular feature?
If you know the feature is important, but you are aware that in some cases other things matter, and you cannot model them, then your problem will be how much to weight your feature. Should it be just distance*other_feature? Why not sqrt(distance)*feature? What about log(distance)*feature? There are countless possibilities, and seek for the best weighting scheme may be much more costfull, then finding a better machine learning model, which can learn your data from its raw features.
If you only suspect the importance of the feature, the best possible option would be to... do not trust this belief. Numerous studies have shown, that machine learning models are better in selecting features then humans. In fact, this is the whole point of non-linear models.
In literature, problem they you are trying to solve is generally refered as incorporating expert knowledge into the learning process. There are thousands of examples, where there is some kind of knowledge that cannot be directly encoded in data representation, yet too valuable to omit it. You should research terms like "machine learning expert knowledge", and its possible synomyms.
There's a fair amount of work treating the kind of problem you're looking at (which is called segmentation) as an optimisation to be performed on a Markov Random Field, which can be solved by graph theoretic methods like GraphCut. Some examples are the work of Pushmeet Kohli at Microsoft Research (try this paper).
What you describe is, in that framework, a prior on node membership, where p(B) is inversely proportional to the distance from the edge (in addition to any other connectivity constraints you want to impose, there's normally a connectedness one, and there will certainly be a likelihood term for the pixel's intensity). The advantage of doing this is that if you can express everything as a probability model, you don't need to rely on heuristics and you can use standard mechanisms for performing inference.
The downside is you need a fairly strong mathematical background to attempt this; I don't know what the scale of the project you're proposing is, but if you want results quickly or you're lacking the necessary background this is going to be pretty daunting.

Resources