How to compute histograms using weka - histogram

Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!

I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.

Related

What is wrong with my approach of using MLP to make a chess engine?

I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.
Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?

Cloud labels affecting % testing accuracy?

I have 96 features and the labels are represented by 1 and -1 for inputting to a deep learning model.
1- PCA
Here the 3 axis represent the 3 first principal components. The blue cloud represents the labels 1 and the red cloud represents the labels -1.
Even if we can identify two different clouds visually, they are stick together. I think we can face a problem during the training phase because of that.
2- t-SNE
For the same features and labels with t-SNE, we can still distinguish two clouds, but again they are stick together.
Questions :
1- Does the fact that the two clouds of dots are stick together can affect the % accuracy during the training and testing phase?
2- When we remove the red and blue color, we have somehow only one big cloud. Is there a way to work around the problem the two clouds ''stuck'' together?
What you call sticking together, means that in this space, your data isn't linearly separable. It doesn't seem to be nonlinearly separable either. I would expect with this these components, that you get poor accuracy for sure.
The way to work around the problem is more or different data. You have some options.
1) What about including more principal components? Maybe, 4, 5, 10 components would solve your problem. That might not work depending on your dataset, but it's the most obvious thing to try first.
2) You could try alternative matrix decomposition techniques. PCA isn't the only one. There's NMF, kernel PCA, LSA, and many others. Which one works best for you will fundamentally be determined by the distribution of your data.
3) Use any other type of feature selection. Frankly, 96 isn't that many, to begin with. You intend on doing deep learning? Wouldn't you normally put all 96 features into a deep learning model? There any many other ways to do feature selection besides matrix decomposition if you need to.
Good luck.

k-medoids How are new centroids picked?

My understanding of K-medoids is that centroids are picked randomly from existing points. Clusters are calculated by dividing remaining points to the nearest centroid. Error is calculated (absolute distance).
a) How are new centroids picked? From examples seams that they are picked randomly? And error is calculated again to see if those new centroids are better or worse.
b) How do you know that you need to stop picking new centroids?
It's worth to read the wikipedia page of the k-medoid algorithm. You are right about that the k medoid from the n data points selected randomly at the first step.
The new medoids are picked by swapping every medoid m and every non-medoid o in a loop and calculating the distance again. If the cost increased you undo the swap.
The algorithm stops if there is no swap for a full iteration.
The process for choosing the initial medoids is fairly complicated.. many people seem to just use random initial centers instead.
After this k medoids always considers every possible change of replacing one of the medoids with one non-medoid. The best such change is then applied, if it improves the result. If no further improvements are possible, the algorithm stops.
Don't rely on vague descriptions. Read the original publications.
Before answering a brief about k-medoids would be needed which i have stated in the first two steps and the last two would answer your questions.
1) The first step of k-medoids is that k-centroids/medoids are randomly picked from your dataset. Suppose your dataset contains 'n' points so these k- medoids would be chosen from these 'n' points. Now you can choose them randomly or you could use approaches like smart initialization that is used in k-means++.
2) The second step is the assignment step wherein you take each point in your dataset and find its distance from these k-medoids and find the one that is minimum and add this datapoint to set S_j corresponding to C_j centroid (as we have k-centroids C_1,C_2,....,C_k).
3) The third step of the algorithm is updation step.This will answer your question regarding how new centroids are picked after they have been initialized. I will explain updation step with an example to make it more clear.
Suppose you have ten points in your dataset which are
(x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10). Now suppose our problem is 2-cluster one so we firstly choose 2-centroids/medoids randomly from these ten points and lets say those 2-medoids are (x_2,x_5). The assignment step will remain same. Now in updation, you will choose those points which are not medoids (points apart from x_2,x_5) and again repeat the assigment and update step to find the loss which is the square of the distance between x_i's from medoids. Now you will compare the loss found using medoid x_2 and the loss found by non-medoid point. If the loss is reduced then you will swap x_2 point with any non-medoid point that has reduced the loss.If the loss is not reduced then you will keep x_2 as your medoid and won't swap.
So, there can be lot of swaps in the updation step which also makes this algorithm computationally high.
4) The last step will answer your second question i.e. when should one stop picking new centroids. When you compare the loss of medoid/centroid point with a loss computed by non-medoid, If the difference is very negligible, the you can stop and keep the medoid point as a centroid only.But if the loss is quite significant then you will have to perform the swapping until the loss reduces.
I Hope that would answer your questions.

Algorithm for variability analysis

I work with a lot of histograms. In particular, these histograms are of basecalls along segments on the human genome.
Each point along the x-axis is one of the four nitrogenous bases(A,C,T,G) that compose DNA and the y-axis represents how many times a base was able to be "called" (or recognized by a sequencer machine, so as to sequence the genome, which is simply determining the identity of each base along the genome).
Many of these histograms display roughly linear dropoffs (when the machines aren't able to get sufficient read depth) that fall to 0 or (almost-0) from plateau-like regions. When the score drops to zero, it means the sequencer isn't able to determine the identity of the base. If you've seen the double helix before, it means the sequencer can't figure out the identify of one half of a rung of the helix. Certain regions of the genome are more difficult to characterize than others. Bases (or x data points) with high numbers of basecalls, on the order of >=100, are able to be definitively identified. For example, if there were a total of 250 calls for one base, and we had 248 T's called, 1 G called, and 1 A called, we would call that a T. Regions with 0 basecalls are of concern because then we've got to infer from neighboring regions what the identity of the low-read region could be. Is there a straightforward algorithm for assigning these plots a score that reflects this tendency? See box.net/shared/nbygq2x03u for an example histo.
You could just use the count of base numbers where read depth was 0... The slope of that line could also be a useful indicator (steep negative slope = drop from plateau).

How can I recognize slightly modified images?

I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg.
I already have a foolproof way to detect if two images are identical: I sum the delta-brightness over all the pixels and compare to a threshold. This method has proven 100% accurate but doing 1 photo against 2 million is incredibly slow (hours per photo).
I would like to fingerprint the images in a way that I could just compare the fingerprints in a hash table. Even if I can reliably whittle down the number of images that I need to compare to just 100, I would be in great shape to compare 1 to 100. What would be a good algorithm for this?
Have a look at O. Chum, J. Philbin, and A. Zisserman, Near duplicate image detection: min-hash and tf-idf weighting, in Proceedings of the British Machine Vision Conference, 2008. They solve the problem you have and demonstrate the results for 146k images. However, I have no first-hand experience with their approach.
Naive idea: create a small thumbnail (50x50 pixels) to find "probably identical" images, then increase thumbnail size to discard more images.
Building on the idea of minHash...
My idea is to make 100 look-up tables using all the images currently in the database. The look-up tables are mapping from the brightness of a particular pixel to a list of images that have that same brightness in that same pixel. To search for an image just input it into the hash tables, get 100 lists, and score a point for each image when it shows up in a list. Each image will have a score from 0 to 100. The image with the most points wins.
There are many issues with how to do this within reasonable memory constraints and how to do it quickly. Proper data structures are needed for storage on disk. Tweaking of the hashing value, number of tables, etc, is possible, too. If more information is needed, I can expand on this.
My results have been very good. I'm able to index one million images in about 24 hours on one computer and I can lookup 20 images per second. Accuracy is astounding as far as I can tell.
I don't think this problem can be solved by hashing. Here's the difficulty: suppose you have a red pixel, and you want 3 and 5 to hash to the same value. Well, then you also want 5 and 7 to hash to the same value, and 7 and 9, and so on... you can construct a chain that says you want all pixels to hash to the same value.
Here's what I would try instead:
Build a huge B-tree, with 32-way fanout at each node, containing all of the images.
All images in the tree are the same size, or they're not duplicates.
Give each colored pixel a unique number starting at zero. Upper left might be numbered 0, 1, 2 for the R, G, B components, or you might be better off with a random permutation, because you're going to compare images in order of that numbering.
An internal node at depth n discriminates 32 ways on the value of the pixel n divided by 8 (this gets out some of the noise in nearby pixels.
A leaf node contains some small number of images, let's say 10 to 100. Or maybe the number of images is an increasing function of depth, so that if you have 500 duplicates of one image, after a certain depth you stop trying to distinguish them.
One all two million nodes are inserted in the tree, two images are duplicate only if they're at the same node. Right? Wrong! If the pixel value in two images are 127 and 128, one goes into outedge 15 and the other goes into outedge 16. So actually when you discriminate on a pixel, you may insert that image into one or two children:
For brightness B, insert at B/8, (B-3)/8, and (B+3)/8. Sometimes all 3 will be equal, and always 2 of 3 will be equal. But with probability 3/8, you double the number of outedges on which the image appears. Depending on how deep things go you could have lots of extra nodes.
Someone else will have to do the math and see if you have to divide by something larger than 8 to keep images from being duplicated too much. The good news is that even if the true fanout is only around 4 instead of 32, you only need a tree of depth 10. Four duplications in 10 takes you up to 32 million images at the leaves. I hope you have plenty of RAM at your disposal! If not, you can put the tree in the filesystem.
Let me know how this goes!
Also good about hash from thumbnails: scaled duplicates are recognized (with little modification)

Resources