Binary classification, one or two targets? - machine-learning

I was wondering if there is a more or less correct way to set your targets when doing binary classification. Let's say I have a time series and I want to predict if the next entry will be UP or DOWN.
Personally I would lean towards setting this up with just one target, [1,0]; 1 for UP and 0 for DOWN. But I've also seen examples with two targets, [1,0] for UP and [0,1] for DOWN. Which (if any) is the correct approach?

You should use one target, as this is the simplest model that expresses all problem's constraints (Okham's razor). Two outputs will work too, but there is nothing to be gained, while you increase (by tiny factor, but you do) your memory consumption, computational complexity etc.

Related

A more imbalanced approach to compute_class_weight

I have a large multi-label array with numbers between 0 and 65. I'm using the following code to generate class weights:
class_weights = class_weight.compute_class_weight('balanced',np.unique(labels),labels)
Where as the labels array is the array containing numbers between 0 and 65.
I'm using this in order to fit a model with class_weight flag, the reason is because I have many examples of "0" and "1" but a low amount of > 1 examples, I wanted the model to give more weight towards the examples with the less counts. This helped alot, however, now, I can see that the model gives too much weight towards the less examples and neglected a bit the examples of highest counts (1 and 0). I'm trying to find a middle approach to this, would love some tips on how to keep going on.
This is something you can achieve in in two ways provided you have done the weight assignment correctly that is giving more weights to less occurring labels and vice versa presumably which you have already done.
Reduce the number of highly occurring labels in your case 0 and 1 to a label with other labels provided it does not diminishes your dataset to big margin. However this can be more often not feasible when other less occurring labels are significantly very less and is something you can decide on
Other and most plausible solution would be either oversample the less occurring labels by creating its copies or under sampling the most occurring labels

Cloud labels affecting % testing accuracy?

I have 96 features and the labels are represented by 1 and -1 for inputting to a deep learning model.
1- PCA
Here the 3 axis represent the 3 first principal components. The blue cloud represents the labels 1 and the red cloud represents the labels -1.
Even if we can identify two different clouds visually, they are stick together. I think we can face a problem during the training phase because of that.
2- t-SNE
For the same features and labels with t-SNE, we can still distinguish two clouds, but again they are stick together.
Questions :
1- Does the fact that the two clouds of dots are stick together can affect the % accuracy during the training and testing phase?
2- When we remove the red and blue color, we have somehow only one big cloud. Is there a way to work around the problem the two clouds ''stuck'' together?
What you call sticking together, means that in this space, your data isn't linearly separable. It doesn't seem to be nonlinearly separable either. I would expect with this these components, that you get poor accuracy for sure.
The way to work around the problem is more or different data. You have some options.
1) What about including more principal components? Maybe, 4, 5, 10 components would solve your problem. That might not work depending on your dataset, but it's the most obvious thing to try first.
2) You could try alternative matrix decomposition techniques. PCA isn't the only one. There's NMF, kernel PCA, LSA, and many others. Which one works best for you will fundamentally be determined by the distribution of your data.
3) Use any other type of feature selection. Frankly, 96 isn't that many, to begin with. You intend on doing deep learning? Wouldn't you normally put all 96 features into a deep learning model? There any many other ways to do feature selection besides matrix decomposition if you need to.
Good luck.

finding better neighbour in Simulated annealing

I am solving TSP using simulated annealing.I have a question that :
In https://en.wikipedia.org/wiki/Simulated_annealing in Efficient candidate generation block it said:
the travelling salesman problem above, for example, swapping two consecutive cities in a low-energy tour is expected to have a modest effect on its energy (length); whereas swapping two arbitrary cities is far more likely to increase its length than to decrease it. Thus, the consecutive-swap neighbour generator is expected to perform better than the arbitrary-swap one.
So I generated first city randomly and second consecutive to the first.but solution got worsen .
am i doing wrong?
Initially you need to explore all the solution surface. Which you can do in two ways, either by generating effectively random candidates, or by having a high temperature. If you don't use method one, you must use method two. Which means ramping up temperature until essentially all moves are accepted. Then you reduce it as slowly as you are able. A "swap adjacent cities" move will then produce a reasonable result.

Relation between features for classification and clustering

I'm a newbie for machine learning, and I have following question. Suppose that I have implemented a classification algorithm on some data, and recognized the best combination of features for the classification algorithm. If someday I get data from same resource, which lack the target feature in previous classification task, Can I use the best combination of features for classification directly to clustering task? (I know I can use the model I trained to predict the target of data, but I just want to know whether the best combination of features is same between classification and clustering algorithms)
I have searched websites and any resource I know, but I can't find the answer for my question, Could somebody tell me or just give me a link? Thanks!
I would say yes, provided the nature of the target is the same in both cases. What we want ideally is a tractable number of features which are orthogonal (perpendicular) to each other in N space, so that each can contribute maximally to the prediction.
Take a concrete example, that of T shirts and whether they are Large size or Small size. You are given data which shows that in the manufacturing process there is a bit of material shrinkage which means the T shirts come out a bit irregular, and the shrinkage varies between the height and width, but not much. The data shows height, width and colour and you want to decide if they are in the large group or the small. You find that the height and width are important but the colour is not, so you decide to go with the height and width as your classification features.
The important point is that these two features have been identified as the most orthogonal to each other, which should apply in a classification or clustering context. The number of clusters remains a factor to be examined.
It may not be good enough.
For example a decision tree or random forest can be analyzed to get the importance of features. But this will not tell you what kind of preprocessing (in particular scaling and weighting) is necessary to be able to cluster them (in particular, categorical features are difficult to use, anything that is not continuous or that is skewed is hard).
Furthermore, data tends to change over time. Features that were important once (e.g. Facebook likes) are useless now.

Machine learning algorithm for this task?

Trying to write some code that deals with this task:
As an starting point, I have around 20 "profiles" (imagine a landscape profile), i.e. one-dimensional arrays of around 1000 real values.
Each profile has a real-valued desired outcome, the "effective height".
The effective height is some sort of average but height, width and position of peaks play a particular role.
My aim is to generalize from the input data so as to calculate the effective height for further profiles.
Is there a machine learning algorithm or principle that could help?
Principle 1: Extract the most import features, instead of feeding it everything
As you said, "The effective height is some sort of average but height, width and position of peaks play a particular role." So that you have a strong priori assumption that these measures are the most important for learning. If I were you, I would calculate these measures at first, and use them as the input for learning, instead of the raw data.
Principle 2: While choosing a learning algorithm, the first thing to care about would be the the linear separability
Suppose the height is a function of those measures, then you have to think about that to what extent the function is linear. For example if the function is almost linear, then a very simple Perceptron would be perfect. Otherwise if it's far from linear, you might want to pick up a multiple-layer neural network. If it's far far far from linear....please turn to principle 1, and check out if you are extracting the right features.
Principle 3: More data help
As you said, you have around 20 "profiles" for training. In general speaking, that's not enough. Almost all of the machine learning algorithms were designed for somehow big data. Even they claimed that their algorithm is good at learning small sample, but usually not as small as 20. Get more data!
Maybe multivariate linear regression suffices?
I would probably use a combination of what you said about which features play the most important role, and then train a regression on that. Basically, you need at least one coefficient corresponding to each feature, and you need substantially more data points than coefficients. So, I would pick something like the heights and width of the two biggest peaks. You've now reduced every profile to just 4 numbers. Now do this trick: divide the data into 5 groups of 4. Pick the first 4 groups. Reduce all those profiles to 4 numbers, and then use the desired outcomes to come up with a regression. Once you have trained the regression, try your technique on the last 4 points and see how well it works. Repeat this procedure 5 times, each time leaving out a different set of data. This is called cross-validation, and it's very handy.
Obviously getting more data would help.

Resources