External linkage - what to do when there is a tie - machine-learning

I am considering to implement a complete linkage clustering algorithm from scratch for study purposes. I've seen that there is a big difference when compared to single linkage:
Unlike single linkage, the complete linkage method can be strongly affected by draw cases (where there are 2 groups/clusters with the same distance value in the distance matrix).
I'd like to see an example of distance matrix where this occurs and understand why it happens.

Consider the 1-dimensional data set
1 2 3 4 5 6 7 8 9 10
Depending how you do the first merges, you can get pretty good or pretty bad results. For example, first merge 2-3, 5-6 and 8-9. Then 2-3-4 and 7-8-9. Compare this to the "obvious" result that most humans would produce.

Related

Interpreting Seasonality in Time Series

I have a discrete time series covering 49 quarters between January 2007 and March 2019, which I am trying to analyse. Before undertaking various forms of analysis I wanted to check for the existence of seasonality and have tried to methods for such in R. In the first I used the WO function (Webel and Ollech) from the seastests package, which informed me that the data did not display seasonality.
library(seastests)
summary(wo(tt))
> summary(wo(tt))
Test used: WO
Test statistic: 0
P-value: 0.8174965 0.5785041 0.2495668
The WO - test does not identify seasonality
However, I wanted to check such again and used the decompose function, from which I got the below, which would appear to suggest a seasonal component. Can anyone advise if;
I am reading the decomposed data correctly?
AND
Why there is such disagreement between decompose and the seastest results?
The decompose function is a simple function that basically estimates the (moving) period average. The volatility of your time series increases strongly in the last years. Thus the averages may pick up on some random increases. Also, the seasonal component that you obtain using the decompose() function will basically always look seasonal.
set.seed(1234)
x <- ts(rnorm(80), frequency=4)
seastests::wo(x)
plot(decompose(x))
Therefore, seasonality tests are preferable to assessing whether a time series really is seasonal.
Still, if you have information that the data generating process has changed, you may want to use the test on the last few years of observations.

Anomaly detection with machine learning without labels

I am tracing multiple signals for a certain period of time and associating them with a timestamp like following:
t0 1 10 2 0 1 0 ...
t1 1 10 2 0 1 0 ...
t2 3 0 9 7 1 1 ... // pressed a button to change the mode
t3 3 0 9 7 1 1 ...
t4 3 0 8 7 1 1 ... // pressed button to adjust a certain characterstic like temperature (signal 3)
where t0 is the tamp stamp, 1 is the value for signal 1, 10 the value for signal 2 and so on.
That captured data during that certain period of time should be considered as the normal case. Now significant derivations should be detected from the normal case. With significant derivation I do NOT mean that one signal value just changes to a value that has not been seen during the tracing phase but rather that a lot of values change that have not yet been related to each other. I do not want to hardcode rules since in the future more signals might be added or removed and other "modi" that have other signal values might be implemented.
Can this be achieved via a certain Machine Learning algorithm? If a small derivation occurs I want the algorithm to first see it as a minor change to the training set and if it occurs multiple times in the future it should be "learned". The major goal is to detect the bigger changes / anomalies.
I hope I could explain my problem detailed enough. Thanks in advance.
you could just calculate the nearest neighbor in your feature space and set a threshold how far its allowed to be away from your test point to not be an anomaly.
Lets say you have 100 values in your "certain period of time"
so you use a 100 dimensional feature space with your training data (which doesn't contain anomalies)
If you get a new dataset you want to test, you calculate the (k) nearest neighbor(s) and calculate the (e.g. euclidean) distance in your featurespace.
If that distance is larger than a certain threshold it's an anomaly.
What you have to do in order to optimize is finding a good k and a good threshold. E.g. by Grid-search.
(1) Note that something like this probably only works well if your data has a fixed starting and ending point. Otherwise you would need a huge amount of data and even than it will not perform as good.
(2) Note It should be worth trying to create an own detector for every "mode" you have mentioned in your question.

understanding "Deep MNIST for Experts"

I am trying to understand Deep MNIST for Experts. I have a quite clear idea of how Neural networks and deep learning works on a high level, but I struggle to understand the details.
In the tutorial the first write and run a simple one layer model. This includes defining the model x*W+b, calculating the entropy, minimizing the entropy by gradient decent and evaluating the result.
The first part I found quite easy to run and understand.
In the second part the build a simple multi level network, and apply some convolutions and pooling. However, here things start to get tricky. They write:
We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolutional will compute 32 features for each 5x5 patch.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
I think a visual model can greatly reduce the difficulty of understanding, so perhaps this can help you understand better:
http://scs.ryerson.ca/~aharley/vis/conv/
This is a 3D visualization of a convolutional neural network, it has two convolution layers and followed with two max pooling layers, you can click a 3D cube in each layer to check the value.
So in general you have to read a lot about CNNs/NN before trying to understand what is really going on. These examples are not really supposed to be introduction course to NN, these do assume you know what CNNs are.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
This is completely different 'level of abstraction', you are comparing unrelated objects to each other, which is obviously confusing. They are creating 32 filters, each will linearly map your whole image, through a 5x5 convolution kernel moving through your image. For example one such filter could be an edge detector:
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
another can detect diagonal lines
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
etc. Why 32? Just a magical number, tested empirically. This is actually quite small number in terms of CNNs (notice that just to detect basic edges in greyscale images you already need 8 different filters!).
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
This is initializer of the weights. This is not a "model for modelling handrwitten numbers", this is simply a starting point for optimization of this part of the parameters space. Why normal distribution? We have some mathematical intuition how to initalize neural nets, especially assuming ReLU activations. It is important to initialize in a random way, which ensures that many of your neurons will be initially active, so you do not get 0 derivatives (thus lack of ability to learn using typical optimizers).
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
In principle you can model everything with a single-hidden layer feed forward net, even without convolutions. However, it might require exponentialy as many hidden units, and perfect optimization strategies which we do not have (and maybe they do not even exist!). Depth of the network gives you ability to express more complex (and for same cases more useful) features with less parameters, plus we know more or less how to optimize it. However you should avoid an often pitfall of assuming "deeper is better". This is not true in general. This is true if important features of your data can be efficiently expressed as a hierarchical structure of abstraction. It is true for images (more and more complex patterns, first edges, then some lines and curves, then patches, then more complex concepct etc.) as well as text, sound etc. but before you try to apply DL for your new task you should ask yourself whether this is (or at least might be) true for your case. Using too complex model is usually way worse than too simple.

Explanation for Values in Scharr-Filter used in OpenCV (and other places)

The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.

Learning how to map numeric values into an array

Deal all,
I am looking for an appropriate algorithm which can allow me to learn how some numeric values are mapped into an array.
Try to imagine that I have a training data set like this:
1 1 2 4 5 --> [0 1 5 7 8 7 1 2 3 7]
2 3 2 4 1 --> [9 9 5 6 6 6 2 4 3 5]
...
1 2 1 8 9 --> [1 4 5 8 7 4 1 2 3 4]
So that given a new set of numeric values, I would like to predict this new array
5 8 7 4 2 --> [? ? ? ? ? ? ? ? ? ?]
Thank you very much in advance.
Best regards!
Some considerations:
Let us suppose that all numbers are integer and the length of the arrays is fixed
Quality of each predicted array can be determine by means of a distance function which try to measure the likeness between the ideal and the predicted array.
This is a challenging task in general. Are your array lengths fixed? What's the loss function (for example is it better to be "closer" for single digits -- is predicting 2 instead of 1 better than predicting 9 or it doesn't matter? Do you get credit for partial matches on the array, such as predicting the first half correct? etc)?
In any case, classical regression or classification techniques would likely not work very well for your scenario. I think the best bet would be to try a genetic programming approach. The fitness function would then be your loss measure i mentioned earlier. You can check this nice comparison for genetic programming libraries for different languages.
This is called a structured output problem, where the target you are trying to predict is a complex structure, rather than a simple class (classification) or number (regression).
As mentioned above, the loss function is an important thing you will have to think about. Minimum edit distance, RMS or simple 0-1 loss could be used.
Structured support vector machine or variations on ridge regression for structured output problems are two known algorithms that can tackle this problem. See wikipedia of course.
We have a research group on this topic at Universite Laval (Canada), led by Mario Marchand and Francois Laviolette. You might want to search for their publications like "Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction" by Sebastien Giguere et al.
Good luck!

Resources