Normal distribution with integer numbers - normal-distribution

I'm running into some quite interesting problems over here. It is about creating a normal distribution with integer numbers in the range of 1 to 5 (1,2,3,4,5). Technically, it is a Poisson distribution with the shape of a normal distribution.
My question: When I create a distribution as mentioned above, tests for normality fail (p < 0.01)(Shapiro Wilk Test, Kolomogorov Smirnov Test), as I created a pool of normally distributed numbers which I rounded:
xRND<-round(rnorm(179,mean=2.9,sd=1))
table(xRND)
xRND
0 1 2 3 4 5 6
2 14 41 67 45 9 1
Is there any test that helps me to check for a normal distribution shape?
Best regards,
St.

From my prospective the Shapiro Wilk Test is the test with the best goodness. Working with rounded values will have an impact to the result. There is a method called Sheppard-correction to calculate a adapted W-value (test statistic) considering rounding of the tested values:
with ω as the rounding difference.
(here the link to the German source)

Related

The random error term is assumed to follow the normal distribution with a mean of 0 in linear regression at each point in x axis

while going through http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis
They have plotted a normal distribution at X=65 and X=90 and say that error term follow a normal distribution , since linear regression is a function that out puts only one value for a given X , how did they plot the distribution?
I would recommend posting this question on Cross Validated (https://stats.stackexchange.com/).
That being said, what the example is trying to explain is that each individual value of X will return a prediction and an error term. The error terms generated for all X should follow a normal distribution with mean 0. Here, 65 and 90 are just examples of possible values of X.

External linkage - what to do when there is a tie

I am considering to implement a complete linkage clustering algorithm from scratch for study purposes. I've seen that there is a big difference when compared to single linkage:
Unlike single linkage, the complete linkage method can be strongly affected by draw cases (where there are 2 groups/clusters with the same distance value in the distance matrix).
I'd like to see an example of distance matrix where this occurs and understand why it happens.
Consider the 1-dimensional data set
1 2 3 4 5 6 7 8 9 10
Depending how you do the first merges, you can get pretty good or pretty bad results. For example, first merge 2-3, 5-6 and 8-9. Then 2-3-4 and 7-8-9. Compare this to the "obvious" result that most humans would produce.

consistent prediction results in Weka despite different seeds value

I am using Weka 3.8.3 multilayer perceptron on Iris dataset. I have 75 training instances and 75 test instances. The thing is no matter how I change the 'seed' parameter, it does not affect the accuracy that much. It's almost always the stats below. Is seed used to randomly initialize the weight? Could someone please help to explain why it behaves this way? Many thanks.
=== Summary ===
Correctly Classified Instances 70 93.3333 %
Incorrectly Classified Instances 5 6.6667 %
I tried the same thing (Training and test 50% percentage split using the radio button) and got 72 and 3 with a random seed for XVal / % Split of 1.
When I change the random seed to 777 (or 666 or 54321) I get 73 and 2, which is a different result, so I can't replicate what you are seeing.
With a random seed of 0 I get 71 and 4.

VLFeat: computation of number of octaves for SIFT

I am trying to go through and understand some of VLFeat code to see how they generate the SIFT feature points. One thing that has me baffled early on is how they compute the number of octaves in their SIFT computation.
So according to the documentation, if one provides a negative value for the initial number of octaves, it will compute the maximum which is given by log2(min(width, height)). The code for the corresponding bit is:
if (noctaves < 0) {
noctaves = VL_MAX (floor (log2 (VL_MIN(width, height))) - o_min - 3, 1) ;
}
This code is in the function is in the vl_sift_new function. Here o_min is supposed to be the index of the first octave (I guess one does not need to start with the full resolution image). I am assuming this can be set to 0 in most use cases.
So, still I do not understand why they subtract 3 from this value. This seems very confusing. I am sure there is a good reason but I have not been able to figure it out.
The reason why they subtract by 3 is to ensure a minimum size of the patch you're looking at to get some appreciable output. In addition, when analyzing patches and extracting out features, depending on what algorithm you're looking at, there is a minimum size patch that the feature detection needs to get a good output and so subtracting by 3 ensures that this minimum patch size is met once you get to the lowest octave.
Let's take a numerical example. Let's say we have a 64 x 64 patch. We know that at each octave, the sizes of each dimension are divided by 2. Therefore, taking the log2 of the smallest of the rows and columns will theoretically give you the total number of possible octaves... as you have noticed in the above code. In our case, either the rows and columns are the minimum value, and taking the log2 of either the rows or columns gives us 7 octaves theoretically (log2(64) = 7). The octaves are arranged like so:
Octave | Size
--------------------
1 | 64 x 64
2 | 32 x 32
3 | 16 x 16
4 | 8 x 8
5 | 4 x 4
6 | 2 x 2
7 | 1 x 1
However, looking at octaves 5, 6 and 7 will probably not give you anything useful and so there's actually no point in analyzing those octaves. Therefore by subtracting by 3 from the total number of octaves, we will stop analyzing things at octave 4, and so the smallest patch to analyze is 8 x 8.
As such, this subtraction is commonly performed when looking at scale-spaces in images because this enforces that the last octave is of a good size to analyze features. The number 3 is arbitrary. I've seen people subtract by 4 and even 5. From all of the feature detection code that I have seen, 3 seems to be the most widely used number. So with what I said, it wouldn't really make much sense to look at an octave whose size is 1 x 1, right?

Learning how to map numeric values into an array

Deal all,
I am looking for an appropriate algorithm which can allow me to learn how some numeric values are mapped into an array.
Try to imagine that I have a training data set like this:
1 1 2 4 5 --> [0 1 5 7 8 7 1 2 3 7]
2 3 2 4 1 --> [9 9 5 6 6 6 2 4 3 5]
...
1 2 1 8 9 --> [1 4 5 8 7 4 1 2 3 4]
So that given a new set of numeric values, I would like to predict this new array
5 8 7 4 2 --> [? ? ? ? ? ? ? ? ? ?]
Thank you very much in advance.
Best regards!
Some considerations:
Let us suppose that all numbers are integer and the length of the arrays is fixed
Quality of each predicted array can be determine by means of a distance function which try to measure the likeness between the ideal and the predicted array.
This is a challenging task in general. Are your array lengths fixed? What's the loss function (for example is it better to be "closer" for single digits -- is predicting 2 instead of 1 better than predicting 9 or it doesn't matter? Do you get credit for partial matches on the array, such as predicting the first half correct? etc)?
In any case, classical regression or classification techniques would likely not work very well for your scenario. I think the best bet would be to try a genetic programming approach. The fitness function would then be your loss measure i mentioned earlier. You can check this nice comparison for genetic programming libraries for different languages.
This is called a structured output problem, where the target you are trying to predict is a complex structure, rather than a simple class (classification) or number (regression).
As mentioned above, the loss function is an important thing you will have to think about. Minimum edit distance, RMS or simple 0-1 loss could be used.
Structured support vector machine or variations on ridge regression for structured output problems are two known algorithms that can tackle this problem. See wikipedia of course.
We have a research group on this topic at Universite Laval (Canada), led by Mario Marchand and Francois Laviolette. You might want to search for their publications like "Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction" by Sebastien Giguere et al.
Good luck!

Resources