I have a data set with x attributes and y records. Given an input record which has up to x-1 missing values, how would I reasonably approximate one of the remaining missing values?
So in the example below, the input record has two values (for attribute 2 and 6, with the rest missing) and I would like to approximate a value for attribute 8.
I know missing values are dealt with through 'imputation' but I'm generally finding examples regarding pre-processing datasets. I'm looking for a solution which uses regression to determine the missing value and ideally makes use of a model which is built once (if possible, to not have to generate one each time).
The number of possibilities for which attributes are present or absent, makes it seem impractical to be able to maintain a collection of models like linear regressions that would cover all of the cases. The one model that seems practical to me is the one that you don't exactly make any model - Nearest Neighbors Regression. My suggestion would be to use whatever attributes you have available and compute distance to your training points. You could use the value from the nearest neighbor or the (possibly weighted) average of several nearest neighbors. In your example, we would use only attributes 2 and 6 to compute distance. The nearest point is the last one (3.966469, 8.911591). That point has value 6.014256 for attribute 8, so that is your estimate of attribute 8 for the new point.
Alternatively, you could use three nearest neighbors. Those are points 17, 8 and 12, so you could use the average of the values of attribute 8 for those points, or a weighted average. People sometimes use the weights 1/dist. Of course, three neighbors is just an example. You could pick another k.
This is probably better than using the global average (8.4) for all missing values of attribute 8.
Related
I am currently trying to find a machine learning algorithm that can predict about 5 - 15 parameters used in a mathematical model(MM). The MM has 4 different ordinary differential equations(ODE) and a few more will be added and thus more parameters will be needed. Most of the parameters can be measured, but others need to be guessed. We know all the 15 parameters, but we want the computer to guess 5 or even 10. To test if parameters are correct, we fill in the parameters in the MM and then calculate the ODEs with a numerical method. Subsequently we calculate the error between the calculations of the model with the parameters we know(and want to guess) and the calculated values of the MM for which we guessed the parameters. Calculating the values of the models ODEs is done multiple times, the ODEs represent one minute in real time and we calculate for 24 hours, thus 1440 calculations.
Currently we are using a particle filter to gues the variables, this works okay but we want to see if there are any better methods out there to gues parameters in a model. The particle filter takes a random value for a parameter which lies between a range we know about the parameter, e.g. 0,001 - 0,01. this is done for each parameter that needs to be guessed.
If you can run a lot of full simulations (tens of thousands) you can try black-box optimization. I'm not sure if black-box is the right approach for you (I'm not familiar with particle filters). But if it is, CMA-ES is a clear match here and easy to try.
You have to specify a loss function (e.g. the total sum of square errors for a whole simulation) and an initial guess (mean and sigma) for your parameters. Among black-box algorithms CMA-ES is a well-established baseline. It is hard to beat if you have only few (at most a few hundreds) continuous parameters and no gradient information. However anything less black-box-ish that can e.g. exploit the ODE nature of your problem will do better.
I am trying to create an outlier dataset, which has 8 columns, some columns contain categorical value, and others contain positive numerical value. And this data contains only two type of datapoint: normal datapoint and outlier.
And i wonder do you know any tools or libraries or some ways that can help me to create this type of dataset automatically. I hear that numpy has tools to generate standard distribution but i think it can't create categorical value.
And like every times, thank you so much for your helps.
Foreword: you should ask yourself a very important question: what is an outlier according to you and try to simulate those afterwards. You can find rough guidelines below:
Numerical values
You could easily do it by creating one dataset with some predefined distribution (say standard normal with mean 0 and variance of 1) and create some data points with it (say 10_000). Other one would come from another distribution (even Gaussian but different mean, variance) and say 50 points being outliers.
Categorical values
Depending on the size of possible categorical values and whether you want both outliers and non-outliers data to be within some range.
Categorical values same range
Say, categorical values are within [0, 10]. So you generate them with numpy's np.random.randint on the whole spectrum, and, say, for 5 columns, so you would get for one example something along the lines of:
[1, 4, 7, 9, 3]
Now outliers could have narrower values contained within [0, 10], say [7,9], so their values could be:
[7, 7, 8, 9, 8]
Given that combination it should be found as an outlier (with some false positives of course as [0, 10] could create something similar in principal).
Categorical values different range
This case is simpler; just use different range and you can be sure no data point will have those values in non-outlier data.
Summary
All in all, you can mix those approaches and vary the degree to make the outlier algorithm's task harder (similar data generating processes) or simpler (features differ greatly between those two).
It should be pretty easy to parametrize above and create a function with varying degree of easness. Unless you need something more complicated don't go for a library (of course you could make the whole idea way more complicated).
I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!
i am working on kmeans clustering .
i have 3d dataset as no.days,frequency,food
->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]
->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]
the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).
if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).
"before, i did x/max(x) as well for days but not acceptable"
so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.
food => 1,2,3
frequency => 1-36
The point of normalization is not just to get the values small.
The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.
For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.
For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.
Unless you have above property, don't use k-means.
You can try to use the Value Difference Metric, VDM (or any variant) to convert pretty much every nominal attribute you encounter to a valid numeric representation. An after that you can just apply standardisation to the whole dataset as usual.
The original definition is here:
http://axon.cs.byu.edu/~randy/jair/wilson1.html
Although it should be easy to find implementations for every common language elsewhere.
N.B. for ordered nominal attributes such as your 'frequency' most of the time it is enough to just represent them as integers.
I'm trying to find ways to normalize my dataset (represented as a matrix with documents as rows and columns as features) and I came across a technique called feature scaling. I found a Wikipedia article on it here.
One of the methods listed is Standardization which says "Feature standardization makes the values of each feature in the data have zero-mean and unit-variance." What does that mean (no pun intended)?
In this method, "we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation." When they say 'subtract the mean', is it the mean of the entire matrix or the mean of the column pertaining to that feature?
Also, if this feature scaling method is applied, does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
The basic idea is to do a simple (and reversible) transformation on your dataset set to make it easier to handle. You are subtracting a constant from each column and then dividing each column by a (different) constant. Those constants are column-specific.
When they say 'subtract the mean', is it the mean of the entire matrix
or the mean of the column pertaining to that feature?
The mean of the column pertaining to that feature.
...does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
Correct. PCA requires data with a mean of zero. Usually this is enforced by subtracting the mean as a first step. If the mean has already been subtracted that step is not required. However, there is no harm in performing the "subtract the mean" operation twice. Because the second time the mean will be zero, so nothing will change. Formally, we might say that standardization is idempotent.
From looking at the article, my understanding is that you would subtract the mean of that feature. This will give you a set of data for the feature that describes the same layout of the data but normalized.
Imagine you added data for a new feature. You're probably going to want the data for your original features to remain the same, and not be influenced by the new feature.
I guess you would still get a "standardized" range of values if you subtracted the mean of the whole data set, but that would be something different - you're probably more interested in how the data of a single feature lies around its mean.
You could also have a look (or ask the question) on math.stackexchange.com.