Class separation from scatterplot visualisation

Class separation from scatterplot visualisation - machine-learning

Above is the image I got after using
plt.scatter(x[feature1],x[classes],c=x[classes])
Now, it appears to me (novice) that the classes are well separated by this feature1.
On applying a RF classifier I get approx 55% accuracy ,noting that total classes are 7 , hence the prediction accuracy is above the random baseline, but I was confused if there is such a distinct seperation(supposedly) then why the results are not on similar terms?

Now, it appears to me (novice) that the classes are well separated by this feature1.
Are they? Let's try to see qualitatively, based on your plot...
For a good separation, we should have classes that do not overlap (or overlap minimally) in the given feature. The only classes that seem to exhibit this characteristic are classes 4 & 5, which have a relatively small overlap around 2500.
Let's take a value of 2750; according to your plot, the respective samples can be of class 1, 2, 3, 5, or 6.
Let's move a little higher, around 3000; the respective samples here can be of class 1, 2, 5, or 7.
Moving lower, around 2500 (the overlapping region of classes 4 & 5), it seems that the samples can be anything except class 7.
The only unambiguous separation seems to be for values greater than 3750 (class 7). Even the lowest values of the feature (< 2000) are shared between classes 3 and 6.
Viewing the big picture, it doesn't look like feature1 on its own offers any exceptional separation between your 7 classes; apart from the areas around its minimum & maximum values, all other ranges are shared by at least 3-4 classes...

Related

A more imbalanced approach to compute_class_weight

I have a large multi-label array with numbers between 0 and 65. I'm using the following code to generate class weights:
class_weights = class_weight.compute_class_weight('balanced',np.unique(labels),labels)
Where as the labels array is the array containing numbers between 0 and 65.
I'm using this in order to fit a model with class_weight flag, the reason is because I have many examples of "0" and "1" but a low amount of > 1 examples, I wanted the model to give more weight towards the examples with the less counts. This helped alot, however, now, I can see that the model gives too much weight towards the less examples and neglected a bit the examples of highest counts (1 and 0). I'm trying to find a middle approach to this, would love some tips on how to keep going on.

This is something you can achieve in in two ways provided you have done the weight assignment correctly that is giving more weights to less occurring labels and vice versa presumably which you have already done.
Reduce the number of highly occurring labels in your case 0 and 1 to a label with other labels provided it does not diminishes your dataset to big margin. However this can be more often not feasible when other less occurring labels are significantly very less and is something you can decide on
Other and most plausible solution would be either oversample the less occurring labels by creating its copies or under sampling the most occurring labels

What is wrong with my approach of using MLP to make a chess engine?

I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.

Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?

Cloud labels affecting % testing accuracy?

I have 96 features and the labels are represented by 1 and -1 for inputting to a deep learning model.
1- PCA
Here the 3 axis represent the 3 first principal components. The blue cloud represents the labels 1 and the red cloud represents the labels -1.
Even if we can identify two different clouds visually, they are stick together. I think we can face a problem during the training phase because of that.
2- t-SNE
For the same features and labels with t-SNE, we can still distinguish two clouds, but again they are stick together.
Questions :
1- Does the fact that the two clouds of dots are stick together can affect the % accuracy during the training and testing phase?
2- When we remove the red and blue color, we have somehow only one big cloud. Is there a way to work around the problem the two clouds ''stuck'' together?

What you call sticking together, means that in this space, your data isn't linearly separable. It doesn't seem to be nonlinearly separable either. I would expect with this these components, that you get poor accuracy for sure.
The way to work around the problem is more or different data. You have some options.
1) What about including more principal components? Maybe, 4, 5, 10 components would solve your problem. That might not work depending on your dataset, but it's the most obvious thing to try first.
2) You could try alternative matrix decomposition techniques. PCA isn't the only one. There's NMF, kernel PCA, LSA, and many others. Which one works best for you will fundamentally be determined by the distribution of your data.
3) Use any other type of feature selection. Frankly, 96 isn't that many, to begin with. You intend on doing deep learning? Wouldn't you normally put all 96 features into a deep learning model? There any many other ways to do feature selection besides matrix decomposition if you need to.
Good luck.

How to Optimise OpenCV Histogram Comparison for Image Similarity?

I use example code to compare HSV histograms using EMD.
I want to find similar images in people's (mobile) picture library. It's quite common that people take several images of the same subject (in a row) with just slight changes: zooming in/out a bit, different angle, different exposure as a result of changing position, other pose, ....
I selected 4 sets of 4 similar images to test this algorithm. When comparing the images inside the sets, I get 22 EMD-L1 values between roughly 0.25 and 2.25 (average 1.47) and 2 outliers around 7.2.
When I cross-comparing between sets I get values between 2 and 15 with an average around 8.
Yes, there is a significant range difference between the two result sets. But I was disappointed that there was no (gap) between these ranges, and instead a small overlap [2.0, 2.25]. I'm hoping to improve the algorithm.
How can I optimise my comparison for my particular use-case? There are various histogram forms, various histogram comparison algorithms, and then each has various parameters.
Does OpenCV implement the fastest known EMD algorithm? I was surprised that the comparison of some histograms took up to a second; especially with the relatively small bin numbers.
Then, some cross-comparisons give good EMD results, but have totally different RGB histograms. Here are two images:
My current EMD-L1 says 1.95, but the RGB histograms are totally different.

Probably you've already refined your comparison method. But this might not be obvious, you could divide the image into overlapping subregions, and then compute the EMD for all 4 parts.

Object classification with Kinect using cascaded classifiers

My project is to create a software that recognizes certain objects like an apple or a coin etc. I want to use Kinect. My question is: Do I need to have a machine learning algorithm like haar classifier to recognize a object or kinect itself can do that?

Kinect itself cannot recognize objects. It will give you a dense depth map. Then you can use the depth features along with some simple features (in your case, maybe color features or gradient features would do the job). Those features you input to a classifier (SVM or Random Forest for example) to train the system. You use the trained model for testing on new samples.
Regarding Haar features, I think they could do the job but you would need a sufficiently large database of features. It all depends on what you want to detect. In the case of an apple and a coin, just color would suffice.
Refer this paper to get an idea how to perform human pose recognition using Kinect camera. You just have to pay attention to their depth features and their classifiers. Do not apply their approach directly. Your problem is simpler.
Edit: simple gradient orientations histogram
Gradient orientations can give you a coarse idea about the shape of the object (It is not a shape-feature to be specific, better shape features exist, but this one is extremely fast to calculate).
Code snippet:
%calculate gradient
[dx,dy] = gradient(double(img));
A = (atan(dy./(dx+eps))*180)/pi; %eps added to avoid division by zero.
A will contain orientation for each pixel. Segment your original image according to the depth values. For a segment having similar depth values, calculate color histogram. Extract the pixel orientations corresponding to that region, call it A_r. calculate a 9-bin (you can have more bins. Nine bins mean each bin will contain 180/9=20 degrees) histogram. Concatenate the color features and the gradient histogram. Do this for sufficient number of leaves. Then you can give this to a classifier for training.
Edit: This is a reply to a comment below.
Regarding MaxDepth parameter in opencv_traincascade
The documentation says, "Maximal depth of a weak tree. A decent choice is 1, that is case of stumps". When you perform binary classification, it takes a form of:
if yourFeatureValue>=learntThresh
class=1;
else
class=0;
end
The above type of classifier which performs thresholding on a single feature value (a scalar) is called decision stumps. There is only one split between positive and negative class (therefore maxDepth is one). For example, it would work in following scenario. Imagine you have a 1-D feature:
f=[1 2 3 4 -1 -2 -3 -4]
First 4 are class 1, rest are class 0. Decision stumps would get 100% accuracy on this data by setting the threshold to zero. Now, imagine a complicated feature space such as:
f=[1 2 3 4 5 6 7 8 9 10 11 12];
First 4 and last 4 are class 1, rest are class 0. Here, you cannot get 100% classification by decision stumps. You need two thresholds/splits. Therefore, you can construct a tree with depth value 2. You will have 2^(2-1)=2 thresholds. For depth=3, you get 4 thresholds, for depth=4, you get 8 thresholds and so on. Here, I assume a tree with a single node has height 1.
You may feel that the more the number of levels, you can achieve more accuracy, but then there is a problem of overfitting (and computation, memory storage etc.). Therefore, you have to set a good value for depth. I usually set it to 3.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart