I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)
In a database there are time-series data with records:
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
...
For every device there are 4 hours of time series data (with an interval of 5 minutes) before an alarm was raised and 4 hours of time series data (again with an interval of 5 minutes) that didn't raise any alarm. This graph describes better the representation of the data, for every device:
I need to use RNN class in python for alarm prediction. We define alarm when the temperature goes below the min limit or above the max limit.
After reading the official documentation from tensorflow here, i'm having troubles understanding how to set the input to the model. Should i normalise the data beforehand or something and if yes how?
Also reading the answers here didn't help me as well to have a clear view on how to transform my data into an acceptable format for the RNN model.
Any help on how the X and Y in model.fit should look like for my case?
If you see any other issue regarding this problem feel free to comment it.
PS. I have already setup python in docker with tensorflow, keras etc. in case this information helps.
You can begin with a snippet that you mention in the question.
Any help on how the X and Y in model.fit should look like for my case?
X should be a numpy matrix of shape [num samples, sequence length, D], where D is a number of values per timestamp. I suppose D=1 in your case, because you only pass temperature value.
y should be a vector of target values (as in the snippet). Either binary (alarm/not_alarm), or continuous (e.g. max temperature deviation). In the latter case you'd need to change sigmoid activation for something else.
Should i normalise the data beforehand
Yes, it's essential to preprocess your raw data. I see 2 crucial things to do here:
Normalise temperature values with min-max or standardization (wiki, sklearn preprocessing). Plus, I'd add a bit of smoothing.
Drop some fraction of last timestamps from all of the time-series to avoid information leak.
Finally, I'd say that this task is more complex than it seems to be. You might want to either find a good starter tutorial on time-series classification, or a course on machine learning in general. I believe you can find a better method than RNN.
Yes you should normalize your data. I would look at differencing by every day. Aka difference interval is 24hours / 5 minutes. You can also try and yearly difference but that depends on your choice in window size(remember RNNs dont do well with large windows). You may possibly want to use a log-transformation like the above user said but also this seems to be somewhat stationary so I could also see that not being needed.
For your model.fit, you are technically training the equivelant of a language model, where you predict the next output. SO your inputs will be the preciding x values and preceding normalized y values of whatever window size you choose, and your target value will be the normalized output at a given time step t. Just so you know a 1-D Conv Net is good for classification but good call on the RNN because of the temporal aspect of temperature spikes.
Once you have trained a model on the x values and normalized y values and can tell that it is actually learning (converging) then you can actually use the model.predict with the preciding x values and preciding normalized y values. Take the output and un-normalize it to get an actual temperature value or just keep the normalized value and feed it back into the model to get the time+2 prediction
I have a 2D time series data-set with integers ranging in 1,000,000 - 2,000,000 output on any given day. Of course my data is not limited, as I can sum up to weekly values hence the range increasing to over 10,000,000.
I'm able to achieve RMSE = 0.02 whenever I normalize my data, but when I feed the raw(1 million range) data into the algorithm, RSME can equal up to 30k - 150k error range.
Why in one version of the RMSE outputs my "global minima" is 0.02, while the other output in higher ranges? I've been testing with AdaDelta.
The definition of RMSE is:
The scale of this value directly depends on the scale on predictions and actuals, so it's quite normal that you get a higher RMSE value when you don't normalize the dataset.
This is why normalization is important, as it lets us compare error metrics across models and datasets.
I'm working on advanced vision system which consist of two static cameras (used for obtaining accurate 3d object location) and some targeting device. Object detection and stereovision modules have been already done. Unfortunately, due to the delay of targeting system it is obligatory to develop a proper prediction module.
I did some tests using Kalman filter but it is working not accurate enough.
kalman = cv2.KalmanFilter(6,3,0)
...
kalman.statePre[0,0] = x
kalman.statePre[1,0] = y
kalman.statePre[2,0] = z
kalman.statePre[3,0] = 0
kalman.statePre[4,0] = 0
kalman.statePre[5,0] = 0
kalman.measurementMatrix = np.array([[1,0,0,0,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]],np.float32)
kalman.transitionMatrix = np.array([[1,0,0,1,0,0],[0,1,0,0,1,0],0,0,1,0,0,1],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32)
kalman.processNoiseCov = np.array([[1,0,0,0,0,0],[0,1,0,0,0,0],0,0,1,0,0,0],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32) * 0.03
kalman.measurementNoiseCov = np.array([[1,0,0],[0,1,0],0,0,1]],np.float32) * 0.003
I noticed that time periods between two frames are different each time (due to the various detection time).
How could I use last timestamp diff as an input? (Transition matrices?, controlParam?)
I want to determine the prediction time e.g want to predict position of object in 0,5sec or 1,5sec
I could provide example input 3d points.
Thanks in advance
1. How could I use last timestamp diff as an input? (Transition matrices?, controlParam?)
Step size is controlled through prediction matrix. You also need to adjust process noise covariance matrix to control uncertainty growth.
You are using a constant speed prediction model, so that p_x(t+dt) = p_x(t) + v_x(t)·dt will predict position in X with a time step dt (and the same for coords. Y and Z). In that case, your prediction matrix should be:
kalman.transitionMatrix = np.array([[1,0,0,dt,0,0],[0,1,0,0,dt,0],0,0,1,0,0,dt],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32)
I left the process noise cov. formulation as an exercise. Be careful with squaring or not squaring the dt term.
2. I want to determine the prediction time e.g want to predict position of object in 0,5sec or 1,5sec
You can follow two different approaches:
Use a small fixed dt (e.g. 0.02 sec for 50Hz) and calculate predictions in a loop until you reach your goal (e.g. get a new observation from your cameras).
Adjusting prediction and process noise matrices online to the desired dt (0,5 / 1,5 sec in your question) and execute a single prediction step.
If you are asking about how to anticipate the detection time of your cameras, that should be a different question and I am afraid I can't help you :-)
According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.