How can the normality of the distribution of variables from a very large sample be checked? - normal-distribution

I have a sample with 10 000 observations and I would like to test the normality of the distribution of the variables in this sample in order to work on Z-scores. Shapiro-Wilk and Kolmogorov-Smirnov tests seem to reach their limits on such a large sample. I drawn qqplots but I wonder if it could be sufficient ?
Thanks for your answers !
Claire

Related

Building a MINLP Heuristic Model in Python

I am currently building a MINLP model which has around 200k decision variables and upto 100 constraints. I have access to only open source solvers which are BONMIN and COUENNE.
When I try to solve the problem, I see that the solver keeps on running for more than 2 hours.
I have been reading the BONMIN documentation and there I see various heuristic algorithms as options. I wanted to know is there any options list i can pass to this BONMIN solver which will trigger a heuristic algorithm that will give me a suboptimal solution in ~15 minutes?
I am working with the Pyomo package.
Thanks in Advance!
See this section of the Pyomo documentation on sending options to a solver: https://pyomo.readthedocs.io/en/latest/working_models.html#sending-options-to-the-solver

Is LIBSVM suitable for many categories and samples?

I'm building a text classifier, which should be able to give probabilities that a document belongs to certain categories (i.e. 80% fiction, 30% marketing etc)
I believe Libsvm does this via the "predict" method, but the problem is that I have approximately 20 categories to test for. Also, I have several hundred documents that can be used for the training.
The problem is that the training file gets 1 GB - 2 GB big, and this makes Libsvc super-slow.
How can this issue be solved? And should I go for Liblinear instead, or are there better options?
Regarding this specific question, I had to use Liblinear as LibSVC kept running forever.
But if anyone wants to know how it eventually turned out:
I switched from PHP / C++ to Python, which was tremendously
easier and did not encounter any memory issues
My case was "multi-labelling". This article put me in the right direction, and the magpie project helped me accomplish the task.

Machine Learning Algorithm for Dynamic Environments

Which methods are best for managing and predicting and labeling data in dynamic environment? The system data distribution changes and it is not static. The system can have different normal settings and under different settings, we have different normal data distributions. Consider we have two classes. Normal and abnormal. What happens? We cannot say that we can rely on historical data and train a simple classification method to predict future observations since one day after training the model, data distribution can change and old observations will become irrelevant to new ones. Consider the following figure:
Blue distribution and red distribution are normal data but under different setting and in the training time we have just one setting. This data is for one sensor. So, suppose we train a model with blue one and also have some abnormal samples. Imagine abnormals samples as normal samples with a little bit noise or fault in measurements. Then, we want to test the model but setting changes and now we have red distribution as our test observations. So, the model misclassifies the samples.
What are the best methods for a situation like this? Please note that I have tried several clustering algorithms but they cannot manage and distinguish between normal and abnormal samples.
Any suggestion and help are highly welcomed. Thanks
There are plenty of books on time series data.
In particular, on change detection. Your example can supposedly be considered a change in mean. There are statistical models to detect this.
Basseville, Michèle, and Igor V. Nikiforov. Detection of abrupt changes: theory and application. Vol. 104. Englewood Cliffs: Prentice Hall, 1993.

Recommended local search optimization algorithm for control domain

Background: I am trying to find a list of floating point parameters for a low level controller that will lead to balance of a robot while it is walking.
Question: Can anybody recommend me any local search algorithms that will perform well for the domain I just described? The main criteria for me is the speed of convergence to the right solution.
Any help will be greatly appreciated!
P.S. Also, I conducted some research and found out that "Evolutianry
Strategy" algorithms are a good fit for continuous state space. However, I am not entirely sure, if they will fit well my particular problem.
More info: I am trying to optimize 8 parameters (although it is possible for me to reduce the number of parameters to 4). I do have a simulator and a criteria for me is speed in number of trials because simulation resets are costly (take 10-15 seconds on average).
One of the best local search algorithms for low number of dimensions (up to about 10 or so) is the Nelder-Mead simplex method. By the way, it is used as the default optimizer in MATLAB's fminsearch function. I personally used this method for finding parameters of some textbook 2nd or 3rd degree dynamic system (though very simple one).
Other option are the already mentioned evolutionary strategies. Currently the best one is the Covariance Matrix Adaption ES, or CMA-ES. There are variations to this algorithm, e.g. BI-POP CMA-ES etc. that are probably better than the vanilla version.
You just have to try what works best for you.
In addition to evolutionary algorithm, I recommend you also check reinforcement learning.
The right method depends a lot on the details of your problem. How many parameters? Do you have a simulator? Do you work in simulation only, or also with real hardware? Speed is in number of trials, or CPU time?

Should I remove test samples that are identical to some training sample?

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?
It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.
It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.
Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

Resources