How to randomly create an outlier dataset? - machine-learning

I am trying to create an outlier dataset, which has 8 columns, some columns contain categorical value, and others contain positive numerical value. And this data contains only two type of datapoint: normal datapoint and outlier.
And i wonder do you know any tools or libraries or some ways that can help me to create this type of dataset automatically. I hear that numpy has tools to generate standard distribution but i think it can't create categorical value.
And like every times, thank you so much for your helps.

Foreword: you should ask yourself a very important question: what is an outlier according to you and try to simulate those afterwards. You can find rough guidelines below:
Numerical values
You could easily do it by creating one dataset with some predefined distribution (say standard normal with mean 0 and variance of 1) and create some data points with it (say 10_000). Other one would come from another distribution (even Gaussian but different mean, variance) and say 50 points being outliers.
Categorical values
Depending on the size of possible categorical values and whether you want both outliers and non-outliers data to be within some range.
Categorical values same range
Say, categorical values are within [0, 10]. So you generate them with numpy's np.random.randint on the whole spectrum, and, say, for 5 columns, so you would get for one example something along the lines of:
[1, 4, 7, 9, 3]
Now outliers could have narrower values contained within [0, 10], say [7,9], so their values could be:
[7, 7, 8, 9, 8]
Given that combination it should be found as an outlier (with some false positives of course as [0, 10] could create something similar in principal).
Categorical values different range
This case is simpler; just use different range and you can be sure no data point will have those values in non-outlier data.
Summary
All in all, you can mix those approaches and vary the degree to make the outlier algorithm's task harder (similar data generating processes) or simpler (features differ greatly between those two).
It should be pretty easy to parametrize above and create a function with varying degree of easness. Unless you need something more complicated don't go for a library (of course you could make the whole idea way more complicated).

Related

Continuous or categorical data in data science

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

Approximate missing value for input given a data set

I have a data set with x attributes and y records. Given an input record which has up to x-1 missing values, how would I reasonably approximate one of the remaining missing values?
So in the example below, the input record has two values (for attribute 2 and 6, with the rest missing) and I would like to approximate a value for attribute 8.
I know missing values are dealt with through 'imputation' but I'm generally finding examples regarding pre-processing datasets. I'm looking for a solution which uses regression to determine the missing value and ideally makes use of a model which is built once (if possible, to not have to generate one each time).
The number of possibilities for which attributes are present or absent, makes it seem impractical to be able to maintain a collection of models like linear regressions that would cover all of the cases. The one model that seems practical to me is the one that you don't exactly make any model - Nearest Neighbors Regression. My suggestion would be to use whatever attributes you have available and compute distance to your training points. You could use the value from the nearest neighbor or the (possibly weighted) average of several nearest neighbors. In your example, we would use only attributes 2 and 6 to compute distance. The nearest point is the last one (3.966469, 8.911591). That point has value 6.014256 for attribute 8, so that is your estimate of attribute 8 for the new point.
Alternatively, you could use three nearest neighbors. Those are points 17, 8 and 12, so you could use the average of the values of attribute 8 for those points, or a weighted average. People sometimes use the weights 1/dist. Of course, three neighbors is just an example. You could pick another k.
This is probably better than using the global average (8.4) for all missing values of attribute 8.

Reshaping Inputs that contain continuous and discrete values

The inputs I am using are 2xN, where the first 1xN row are continuous numbers, and the second 1xN row are discrete numbers (that encodes a specific class out of 7 possible classes). I expect there to be a relation between vertically adjacent pairs.
I am looking to use a neural net for a multi-class classifier on this input, but am unsure of how to reshape my data for forward propagation in a way that makes sense.
What is a feasible way to reshape my data into 1x2N for forward propogation that makes sense?
edit:
Example input:
input_features = [[99.3, 22.1, 41.7], [1, 3, 4]]
Unless you know something more than "there might be some kind of relation", you should just flatten the array and pass it as a vector - NN can (in theory) find such realtions on its own (given enough data).
What are the other options? If you suspect that there is a single relation, such that it is true for every single column, then you might want to construct specific neural net. One option is to have a convolution of size 2x1 (single column) in the input layer. On the other hand - if you create large enough set of kernels, this will be able to model more complex relations too. In such case - leave it as a matrix (think about it as an image). There is nothing wrong with discrete values, as long as they are in the reasonable scale.
In general - you will actually just work with specific wiring of the net, not reshaping of an array (however, implementations of conv nets actually use shape to do the work for you, as described).

Neural Network Normalization of Nominal Data for 1 Output Neuron

I am new to machine learning and AI and started with NN recently.
Already got some information here on stackoverflow, but I don't understand the logic from the whole gathered information at the moment.
Let's take 4 nominal (but not ordinal) values [A, B, C, D] and 2 numericals already normalized [0.35, 0.55] - so 2 input neurons, one for nominal one for numerical.
I mostly see in NN literature you have to use 4 input neurons for encoding. But I don't need it to predict those nominal ones. I have only one output neuron that represents at most a relationship in the way if I would use it with expert systems and rules.
If I would normalize them to [0.2, 0.4, 0.6, 0.8] for example, isn't the NN able to distinguish between them? For the NN it's only a number, isn't it?
Naive approach and thinking:
A with 0.35 numerical leads to ideal 1.
B with 0.55 numerical leads to ideal 0.
C with 0.35 numerical leads to ideal 0.
D with 0.55 numerical leads to ideal 1.
Is there a mistake in my way of thinking about this approach?
Additional info (edit):
Those nominal values are included in decision making (significance if measured with statistics tools by combining with the numerical values), depends if they are true or not. I know they can be encoded binary, but the list of nominal values is a litte bit larger.
Other example:
Symptom A with blood test 1 leads to diagnosis X (the ideal)
Symptom B with blood test 1 leads to diagnosys Y (the ideal)
Actually expert systems are used. Symptoms are nominal values, but in combination with the blood test value you get the diagnosis. The main question finally: Do I have to encode symptoms in binary way or can I replace symptoms with numbers? If I can't replace it with numbers, why binary representation is the only way in usage of a NN?
INPUTS
Theoretically it doesn't really matter how do you encode your inputs. As long as different samples will be represented by different points in the input space it is possible to separate them with a line - and that what's the input layer (if it's linear) is doing - it combines the inputs linearly. However, the way the data is laid out in the input space can have huge impact on convergence time during learning. A simple way to see this is this: imagine a set of lines crossing the origin in the 2D space. If your data is scattered around the origin, then it is likely that some of these lines will separate data into parts, and few "moves" will be required, especially if the data is linearly separable. On the other hand, if your input data is dense and far from the origin, then most of initial input discrimination lines won't even "hit" the data. So it will require a large number of weight updates to reach the data, and the large amount of precise steps to "cut" it into initial categories.
OUTPUTS
If you have categories then encoding them as binary is quite important. Imagine that you have three categories: A, B and C. If you encode them with two three neurons as 1;0;0, 0;1;0 and 0;0;1 then during learning and later with noisy data a point about which network is "not sure" can end up as 0.5;0.0;0.5 on the output layer. That makes sense, if it is really something conceptually between A and C, but surely not B. If you'd choose one output neuron end encode A, B and C as 1, 2 and 3, then for the same situation the network would give an input of average between 1 and 3 which gives you 2! So the answer would be "definitely B" - clearly wrong!
Reference:
ftp://ftp.sas.com/pub/neural/FAQ.html

How to pre-process dataset for maximum effectiveness with LibSVM Weka implementation

So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.

Resources