Question
I'm trying to deal following task with machine learning, but the performance is not so good. Actually I'm not familiar with around machine learning and data science, so I don't have much acknowledgement. Don't you know there was some similar tasks in the past at like Kaggle?
Task
The dataset is several queries and list of contents respect to each queries.
Contents in each queries have 0 or 1 as a label.
Almost all contents in each queries has 0 as a label.
There is only 1 or 0 content which has 1 as a label in each queries.
I wanna give the highest output from the model to the content which has 1 as a label in each queries.
I don't care about the order, difference of output from the model among contents with label 0, I just wanna bring content with label 1 at No.1 in each queries.
When the model gives exactly the same output to the all of contents, the content with label 1 will be No.1 of the ranking in the query, duh. But this doesn't have no meaning.
What I did
At first, I didn't look the dataset by each queries, and I treated as a classification task to classify into 0 or 1. Let's say the model could classify a content with label 1 as 1, but sometimes there was content with label 0 classified as 1 with higher score than the content whose label is 1 in the same query. Actually the order (content whose label is 1 comes No.1 in each queries) is the most important thing, so I'm using "Learn to Rank" now.
Problems I'm facing
I'm visiting many website which describes on Learn to Rank, but I can't find the case like this, I don't know how can I call, like binary ranking.
Actually I'm using "LambdaRank" method, which scale the differential of loss function(cross entropy) because I'm expecting this method will contribute to bring the content with label 1 to the top of the list in each queries. And I'm using LightGBM or PyTorchBut now I'm facing several problems like this.
Because almost all contents has 0 as their label, so the model can make loss small with predicting all of them are 0. Then, the slope of loss function is almost 0, so the training will not progress. Then, all of the contents is No.1 in the ranking.
(In PyTorch,) training is depends on the beginning of the training. In many cases, the model will predict all of the content is 0 at the 1st epoch of training. Then, the training will not progress as I said before. But, I'm not sure the reason, but sometimes, there is the content with label 1 at the top of ranking with about 10% of queries. In this case, the training will progress.
(Confirming with PyTorch is not yet) After training, about 80% of contents with label 1 is No.1 in their queries. But there is several queries who has no content with label 1, only has contents with label 0. Actually I wanna cut such queries, so I did score-cut but it was not effective. So, I guess there is no consistency of prediction among queries. Let's say there is 2 queries and predictions on each contents is like, query1[A:0.9, B:0.6, C:0.1] query2[D:0.7, E:0.2], then A is more relevant than D respect to each query?
Ideas but I've not tried yet
Training will not progress due to a lot of contents with label 0.
Use any other loss function like focal loss.
Use the differential of loss function at the prediction on content with label 1 for updating model parameters.
To reduce contents with label 0 at the No.1 in the ranking even if there is also the content with label 1 at the No.1,
Create custom metric which reduces them.
Create custom metric which compares the prediction of the content with label 1 and the content with the highest score in the list except for the content with label 1.
But I guess these metrics are not differentiable, but I think I can use for scaling differential of loss function instead of NDCG in LambdaRank method.
Related
I have a large multi-label array with numbers between 0 and 65. I'm using the following code to generate class weights:
class_weights = class_weight.compute_class_weight('balanced',np.unique(labels),labels)
Where as the labels array is the array containing numbers between 0 and 65.
I'm using this in order to fit a model with class_weight flag, the reason is because I have many examples of "0" and "1" but a low amount of > 1 examples, I wanted the model to give more weight towards the examples with the less counts. This helped alot, however, now, I can see that the model gives too much weight towards the less examples and neglected a bit the examples of highest counts (1 and 0). I'm trying to find a middle approach to this, would love some tips on how to keep going on.
This is something you can achieve in in two ways provided you have done the weight assignment correctly that is giving more weights to less occurring labels and vice versa presumably which you have already done.
Reduce the number of highly occurring labels in your case 0 and 1 to a label with other labels provided it does not diminishes your dataset to big margin. However this can be more often not feasible when other less occurring labels are significantly very less and is something you can decide on
Other and most plausible solution would be either oversample the less occurring labels by creating its copies or under sampling the most occurring labels
In each tiny step of doc2vec training process, it takes a word and its neighbors within certain length(called window size). The neighbors are summed up, averaged, or concated, and so on and so on.
My question is, what if the window exceed the boundary of a certain doc, like
this
Then how are the neighbors summed up, averaged, or concated? Or they are just simply discarded?
I am doing some nlp work and most doc in my dataset are quite short. Appeciate for any idea.
The pure PV-DBOW mode (dm=0), which trains quickly and often performs very well (especially on short documents), makes use of no sliding window at all. Each per-document vector is just trained to be good at directly predicting the document's words - neighboring words don't make any difference.
Only when you either switch to PV-DM mode (dm=1), or add interleaved skip-gram word-vector training (dm=0, dbow_words=1) is the window relevant. And then, the window is handled the same as in Word2Vec training: if it would go past either end of the text, it's just truncated to not go over the end, perhaps leaving the effective window lop-sided.
So if you have a text "A B C D E", and a window of 2, when predicting the 1st word 'A', only the 'B' and 'C' to the right contribute (because there are zero words to the left). When predicting the 2nd word 'B', the 'A' to the left and the 'C' and 'D' to the right contribute. And so forth.
An added wrinkle is that to effect a stronger weighting of nearby words in a computationally-efficient manner, the actual window used for any one target prediction is actually of a random size from 1 up to the configured window value. So for window=2, half the time it's really only using a window of 1 on each side, and the other half the time using the full window of 2. (For window=5, it's using an effective value of 1 for 20% of the predictions, 2 for 20% of the predictions, 3 for 20% of the predictions, 4 for 20% of the predictions, and 5 for 20% of the predictions.) This effectively gives nearer words more influence, without the full computational cost of including all full-window words every time or any extra partial-weighting calculations.
I am working on a case where the dimension of labels would increasing with time. For example, at time t, the output is a 10 by 1 vector. Later, at time t+5, the output becomes a 15 by 1 vector.
In this case, for the same input, the first 10 entries of output at time t are the same as the ones at time t+5. But the rest 5 are different. The reason that the dimension of output vector is increased is that every time when we are given a new training sample, the dimension of the label of all previous training sample increases by 1. So the expected output of the neural network is changed correspondingly.
The trivial solution is to re-train the whole model such that it can handle desired dimension of output. I know it might sound weird but I am wondering is there any smart design to build up a dynamic network such that the network can be incrementally trained by incrementally changing labels.
Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset - say some records with label "x" appear only in validation set and not in the training.
Is this the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.
The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?
This is what stratified sampling is supposed to solve.
https://en.wikipedia.org/wiki/Stratified_sampling
Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!
I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.