I have a large multi-label array with numbers between 0 and 65. I'm using the following code to generate class weights:
class_weights = class_weight.compute_class_weight('balanced',np.unique(labels),labels)
Where as the labels array is the array containing numbers between 0 and 65.
I'm using this in order to fit a model with class_weight flag, the reason is because I have many examples of "0" and "1" but a low amount of > 1 examples, I wanted the model to give more weight towards the examples with the less counts. This helped alot, however, now, I can see that the model gives too much weight towards the less examples and neglected a bit the examples of highest counts (1 and 0). I'm trying to find a middle approach to this, would love some tips on how to keep going on.
This is something you can achieve in in two ways provided you have done the weight assignment correctly that is giving more weights to less occurring labels and vice versa presumably which you have already done.
Reduce the number of highly occurring labels in your case 0 and 1 to a label with other labels provided it does not diminishes your dataset to big margin. However this can be more often not feasible when other less occurring labels are significantly very less and is something you can decide on
Other and most plausible solution would be either oversample the less occurring labels by creating its copies or under sampling the most occurring labels
Related
Question
I'm trying to deal following task with machine learning, but the performance is not so good. Actually I'm not familiar with around machine learning and data science, so I don't have much acknowledgement. Don't you know there was some similar tasks in the past at like Kaggle?
Task
The dataset is several queries and list of contents respect to each queries.
Contents in each queries have 0 or 1 as a label.
Almost all contents in each queries has 0 as a label.
There is only 1 or 0 content which has 1 as a label in each queries.
I wanna give the highest output from the model to the content which has 1 as a label in each queries.
I don't care about the order, difference of output from the model among contents with label 0, I just wanna bring content with label 1 at No.1 in each queries.
When the model gives exactly the same output to the all of contents, the content with label 1 will be No.1 of the ranking in the query, duh. But this doesn't have no meaning.
What I did
At first, I didn't look the dataset by each queries, and I treated as a classification task to classify into 0 or 1. Let's say the model could classify a content with label 1 as 1, but sometimes there was content with label 0 classified as 1 with higher score than the content whose label is 1 in the same query. Actually the order (content whose label is 1 comes No.1 in each queries) is the most important thing, so I'm using "Learn to Rank" now.
Problems I'm facing
I'm visiting many website which describes on Learn to Rank, but I can't find the case like this, I don't know how can I call, like binary ranking.
Actually I'm using "LambdaRank" method, which scale the differential of loss function(cross entropy) because I'm expecting this method will contribute to bring the content with label 1 to the top of the list in each queries. And I'm using LightGBM or PyTorchBut now I'm facing several problems like this.
Because almost all contents has 0 as their label, so the model can make loss small with predicting all of them are 0. Then, the slope of loss function is almost 0, so the training will not progress. Then, all of the contents is No.1 in the ranking.
(In PyTorch,) training is depends on the beginning of the training. In many cases, the model will predict all of the content is 0 at the 1st epoch of training. Then, the training will not progress as I said before. But, I'm not sure the reason, but sometimes, there is the content with label 1 at the top of ranking with about 10% of queries. In this case, the training will progress.
(Confirming with PyTorch is not yet) After training, about 80% of contents with label 1 is No.1 in their queries. But there is several queries who has no content with label 1, only has contents with label 0. Actually I wanna cut such queries, so I did score-cut but it was not effective. So, I guess there is no consistency of prediction among queries. Let's say there is 2 queries and predictions on each contents is like, query1[A:0.9, B:0.6, C:0.1] query2[D:0.7, E:0.2], then A is more relevant than D respect to each query?
Ideas but I've not tried yet
Training will not progress due to a lot of contents with label 0.
Use any other loss function like focal loss.
Use the differential of loss function at the prediction on content with label 1 for updating model parameters.
To reduce contents with label 0 at the No.1 in the ranking even if there is also the content with label 1 at the No.1,
Create custom metric which reduces them.
Create custom metric which compares the prediction of the content with label 1 and the content with the highest score in the list except for the content with label 1.
But I guess these metrics are not differentiable, but I think I can use for scaling differential of loss function instead of NDCG in LambdaRank method.
I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.
Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?
I am solving TSP using simulated annealing.I have a question that :
In https://en.wikipedia.org/wiki/Simulated_annealing in Efficient candidate generation block it said:
the travelling salesman problem above, for example, swapping two consecutive cities in a low-energy tour is expected to have a modest effect on its energy (length); whereas swapping two arbitrary cities is far more likely to increase its length than to decrease it. Thus, the consecutive-swap neighbour generator is expected to perform better than the arbitrary-swap one.
So I generated first city randomly and second consecutive to the first.but solution got worsen .
am i doing wrong?
Initially you need to explore all the solution surface. Which you can do in two ways, either by generating effectively random candidates, or by having a high temperature. If you don't use method one, you must use method two. Which means ramping up temperature until essentially all moves are accepted. Then you reduce it as slowly as you are able. A "swap adjacent cities" move will then produce a reasonable result.
Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!
I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.
Trying to write some code that deals with this task:
As an starting point, I have around 20 "profiles" (imagine a landscape profile), i.e. one-dimensional arrays of around 1000 real values.
Each profile has a real-valued desired outcome, the "effective height".
The effective height is some sort of average but height, width and position of peaks play a particular role.
My aim is to generalize from the input data so as to calculate the effective height for further profiles.
Is there a machine learning algorithm or principle that could help?
Principle 1: Extract the most import features, instead of feeding it everything
As you said, "The effective height is some sort of average but height, width and position of peaks play a particular role." So that you have a strong priori assumption that these measures are the most important for learning. If I were you, I would calculate these measures at first, and use them as the input for learning, instead of the raw data.
Principle 2: While choosing a learning algorithm, the first thing to care about would be the the linear separability
Suppose the height is a function of those measures, then you have to think about that to what extent the function is linear. For example if the function is almost linear, then a very simple Perceptron would be perfect. Otherwise if it's far from linear, you might want to pick up a multiple-layer neural network. If it's far far far from linear....please turn to principle 1, and check out if you are extracting the right features.
Principle 3: More data help
As you said, you have around 20 "profiles" for training. In general speaking, that's not enough. Almost all of the machine learning algorithms were designed for somehow big data. Even they claimed that their algorithm is good at learning small sample, but usually not as small as 20. Get more data!
Maybe multivariate linear regression suffices?
I would probably use a combination of what you said about which features play the most important role, and then train a regression on that. Basically, you need at least one coefficient corresponding to each feature, and you need substantially more data points than coefficients. So, I would pick something like the heights and width of the two biggest peaks. You've now reduced every profile to just 4 numbers. Now do this trick: divide the data into 5 groups of 4. Pick the first 4 groups. Reduce all those profiles to 4 numbers, and then use the desired outcomes to come up with a regression. Once you have trained the regression, try your technique on the last 4 points and see how well it works. Repeat this procedure 5 times, each time leaving out a different set of data. This is called cross-validation, and it's very handy.
Obviously getting more data would help.