time series forecasting using Support Vector Machine - time-series

What are the attributes used in time series to be forecasted using SVM? I have two values the date and the value at that date for the class I already know that I can use -1 and 1 when price gets up or down but still don't know how to plot the time series to calculate the hyperplane

There are some papers that show some ways to do it:
Financial time series forecasting using support vector machine
Using Support Vector Machines in Financial Time Series Forecasting
Financial Forecasting Using Support Vector Machines
I really recommend that you go through the existent literature, but just for fun I will describe an easy way (probably not the best) to do it.
Let's say you have N pairs where is the particular date/time of the pair and its corresponding value. The pairs are sorted by its X component.
Let's say you want to predict if given , the corresponding unknown value will go up or down (Notice that you could also use regression and instead try to predict the value itself).
Then we could train a model with a training set like this:
Input Value
====================== ================
y_t0, y_t1, ..., y_ti-1 1 :if y_ti > y_ti-1, -1 otherwise
y_t1, y_t2, ..., y_ti 1 :if y_ti+1 > y_ti, -1 otherwise
y_t2, y_t3, ..., y_ti+1 1 :if y_ti+2 > y_ti+1, -1 otherwise
y_t3, y_t4, ..., y_ti+2 1 :if y_ti+3 > y_ti+2, -1 otherwise
y_t4, y_t5, ..., y_ti+3 1 :if y_ti+4 > y_ti+3, -1 otherwise
Basically, you will be training the algorithm to make an educated guess of the next "tick" in the future by given to it a glimpse of the past. Once your model is trained to make a prediction you feed the model with the N values (where N is the amount of values you used as input in your training phase) previous to the value you want to predict.

Related

Setting correct input for RNN

In a database there are time-series data with records:
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
device - timestamp - temperature - min limit - max limit
...
For every device there are 4 hours of time series data (with an interval of 5 minutes) before an alarm was raised and 4 hours of time series data (again with an interval of 5 minutes) that didn't raise any alarm. This graph describes better the representation of the data, for every device:
I need to use RNN class in python for alarm prediction. We define alarm when the temperature goes below the min limit or above the max limit.
After reading the official documentation from tensorflow here, i'm having troubles understanding how to set the input to the model. Should i normalise the data beforehand or something and if yes how?
Also reading the answers here didn't help me as well to have a clear view on how to transform my data into an acceptable format for the RNN model.
Any help on how the X and Y in model.fit should look like for my case?
If you see any other issue regarding this problem feel free to comment it.
PS. I have already setup python in docker with tensorflow, keras etc. in case this information helps.
You can begin with a snippet that you mention in the question.
Any help on how the X and Y in model.fit should look like for my case?
X should be a numpy matrix of shape [num samples, sequence length, D], where D is a number of values per timestamp. I suppose D=1 in your case, because you only pass temperature value.
y should be a vector of target values (as in the snippet). Either binary (alarm/not_alarm), or continuous (e.g. max temperature deviation). In the latter case you'd need to change sigmoid activation for something else.
Should i normalise the data beforehand
Yes, it's essential to preprocess your raw data. I see 2 crucial things to do here:
Normalise temperature values with min-max or standardization (wiki, sklearn preprocessing). Plus, I'd add a bit of smoothing.
Drop some fraction of last timestamps from all of the time-series to avoid information leak.
Finally, I'd say that this task is more complex than it seems to be. You might want to either find a good starter tutorial on time-series classification, or a course on machine learning in general. I believe you can find a better method than RNN.
Yes you should normalize your data. I would look at differencing by every day. Aka difference interval is 24hours / 5 minutes. You can also try and yearly difference but that depends on your choice in window size(remember RNNs dont do well with large windows). You may possibly want to use a log-transformation like the above user said but also this seems to be somewhat stationary so I could also see that not being needed.
For your model.fit, you are technically training the equivelant of a language model, where you predict the next output. SO your inputs will be the preciding x values and preceding normalized y values of whatever window size you choose, and your target value will be the normalized output at a given time step t. Just so you know a 1-D Conv Net is good for classification but good call on the RNN because of the temporal aspect of temperature spikes.
Once you have trained a model on the x values and normalized y values and can tell that it is actually learning (converging) then you can actually use the model.predict with the preciding x values and preciding normalized y values. Take the output and un-normalize it to get an actual temperature value or just keep the normalized value and feed it back into the model to get the time+2 prediction

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Predictive modelling

How to perform regression(Random Forest,Neural Networks) for this kind of data?
The data contains features and we need to predict sales qty based on week and attributes
here I am attaching the sample data
Here we are trying to predict sales quantity based on other attributes
Multivariate linear regression
Assuming
input variables x[][] (each row corresponds to a sample, each column corresponds to a variable such as week, season, ..)
expected output y[] (as many rows as x)
parameters being learned theta[] (as many as there are input variables + 1)
you are optimizing a function h:
h = sum for all j of { x[j][i] * p[i] - y[j] } is minimal
This can easily be achieved through gradient descent.
You can also include combinations of parameters (and simply include more thetas for those pseudo-parameters)
I have some code lying around in a GitHub repository that performs basic multivariate linear regression (for a course I sometimes teach).
https://github.com/jorisschellekens/ml/tree/master/linear_regression

Categorical and ordinal feature data difference in regression analysis?

I am trying to completely understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:
Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect
Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct
Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Example for categorical:
data = {'color': ['blue', 'green', 'green', 'red']}
Numeric format after One-Hot encoding:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Example for ordinal:
data = {'con': ['old', 'new', 'new', 'renovated']}
Numeric format after using mapping: Old < renovated < new → 0, 1, 2
0 0
1 2
2 2
3 1
In my data price increases as condition changes from "old" to "new". "Old" in numeric was encoded as '0'. 'New' in numeric was encoded as '2'. So, as condition increases, then price also increases. Correct.
Now lets have a look at 'color' feature. In my case, different colors also affect price. For example, 'black' will be more expensive than 'white'. But from above mentioned numeric representation of categorical data, I do not see increasing dependancy as it was with 'condition' feature. Does it mean that change in color does not affect price in regression model if using one-hot encoding? Why to use one-hot encoding for regression if it does not affect price anyway? Can you clarify it?
UPDATE TO QUESTION:
First I introduce formula for linear regression:
Let have a look at data representations for color:
Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding:
In this case different thetas for different colors will exist and prediction will be:
Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$ (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$ (thetas are assumed for example)
Ordinal encoding for color:
In this case all colors have common theta but multipliers differ:
Price (1 item) = 0 + 20*10 = 200$ (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$ (theta assumed for example)
In my model White < Red < Black in prices. Seem to be that it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?
You will see not increasing dependency. The whole point of this discrimination is that colour is not a feature you can meaningfully place on a continuum, as you've already noted.
The one-hot encoding makes it very convenient for the software to analyze this dimension. Instead of having a feature "colour" with the listed values, you have a set of boolean (present / not-present) features. For instance, your row 0 above has features color_blue = true, color_green = false, and color_red = false.
The prediction data you get should show each of these as a separate dimension. For instance, presence of color_blue may be worth $200, while green is -$100.
Summary: don't look for a linear regression line running across a (non-existent) color axis; rather, look for color_* factors, one for each color. As far as your analysis algorithm is concerned, these are utterly independent features; the "one-hot" encoding (a term from digital circuit design) is merely our convention for dealing with this.
Does this help your understanding?
After your edit of the question 02:03 Z 04 Dec 2015:
No, your assumption is not correct: the two representations are not merely a matter of convenience. The ordering of colors works for this example -- because the effect happens to be a neat, linear function of the chosen encoding. As your example shows, your simpler encoding assumes that White-to-Red-to-Black pricing is a linear progression. What do you do when Green, Blue, and Brown are all $25, the rare Yellow is worth $500, and Transparent reduces the price by $1,000?
Also, how is it that you know in advance that Black is worth more than White, in turn worth more than Red?
Consider the case of housing prices based on elementary school district, with 50 districts in the area. If you use a numerical coding -- school district number, ordinal position alphabetically, or some other arbitrary ordering -- the regression software will have great trouble finding a correlation between that number and the housing price. Is PS 107 a more expensive district than PS 32 or PS 15? Are Addington and Bendemeer preferred to Union City and Ventura?
Splitting these into 50 different features under that one-hot principle decouples the feature from the encoding, and allows the analysis software to treat with them in a mathematically meaningful manner. It's not perfect by any means -- expanding from, say, 20 features to 70 means that it will take longer to converge -- but we do get meaningful results for the school district.
If you wish, you could now encode that feature in the expected order of value, and get a reasonable fit with little loss of accuracy and faster prediction from your model (fewer variables).
You cannot use ordinal encoding for a categorical variable where order doesn't matter. Main purpose of building a regression model is to see how much change in one variable has how much effect on the response variable. When you obtain the regression formula this is how you read it: "1 unit change in variable X causes theta_x change in response variable".
For example, let's say you built a regression model on housing prices and you got this: price = 1000 + (-50)*age_of_house. This means 1 year increase in the age of the house causes the price go down by 50.
When you have a categorical variable you cannot mention a unit change in that variable. You cannot say 1 unit increase/decrease in the color... etc. So, one-hot encoding, as Prune said in his/her answer, is merely a convention for dealing with categorical variables. It allows you to interpret the results like, if the house is white it adds $200 to the value when coefficient of color_white in your final model is +200. If the house is not white, that variable has no impact on your response variable because the value will be 0.
Don't forget that "Linear Regression" models can only explain linear relations between variables.
I hope this helps.

Machine Learning Model for Multi-Label Classification where we know relationship between the labels

I am having a problem at hand where,
I need to classify the input data to one or more of the labels S1, S2, S3, S4
There is a relationship between the labels S1, S2, S3 and S4 which is,
If input is labelled Sn it must be labelled S1..Sn.
S1, S2, S3 and S4 are like different stages for an entity X to pass through. Based on input data X might get through one or many of the stages, X must pass through S1 to go to S2, S2 to go to S3 and so on
We want to ensure that only those X are allowed to pass which reach S3, so based on input data we decide whether to allow X to go through S1 or not
What machine learning models can we choose to predict if X reaches S3 if we have information like, input data and what stages X has passed for that input data
I am thinking in direction of a multi label classification There might be some relationship between input data stage S1 and S2
Update: I have to train with examples like
1. Input data is s1
2. Input data is s2
3. ..
4 ..
Some doubts
Your question is far from being clear, for example:
We want to optimize that most X reaches S3, so based on input data we decide whether to allow X to go through S1 or not
Actually suggest, that the best model would be "always answer yes" ,as it maximized number of objects reaching S3 (as it simply lets any object reach this point)
General ideas
I assume two possible interpretations:
You have a labels "pipeline", which simply means, that object cannot be labelled S_n if it has not been already labelled with all S_i for i < n
This does not seem to be the problem for one single model, you can pipeline models in a natural way, ie. train a model 1 which regognizes, if object x should have label S_1. Next, you train a model 2 on all data that has label S_1 in the training set and predict label S_2, and so on. During execution you simply ask each model i if it accepts (labels) the incoming object x, and stop when the first one says "no"
You have some more complex constraints on the labels, which may be strict or not.For such cases, you should try one of many methods of multi label classification with constraints, in particular there is a tech report regarding this aspect of ML.
Solution 1 - approximating test functions
If your problem can be described as:
You have data points X, such that for each of them you know the maximum number of some pipelineable tests T_i which x passes
You want to train a classifier able to predict, what is the maximum number of consequtive tests that your point x passes
You do not have access to actual tests T_i or they are very inefficient
Then the simplest way would be to apply the following training procedure instead of one classifier:
Take all your data points, label those with y=0 as 0 and those with y>=1 as 1 and train some binary classifier (for example SVM). So you simply temporarly relabel your data so it shows points that pass the first test and those who don't. Lets call this classifier cl_1
Now take your data points, label those with y=1 as 0 and those with y>=2 as 1 and again train binary classifier, and call it cl_2
Repest until all tests have their classifier, in general in we call the classifier cl_i when it can distinguish between points labeled with y=i-1 and those with y>=i.
Now, to classify your new point, you simply check iteratively all your cl_i for i=1,..,tests and answer with the largest such i that cl_i(x)=1. So you "simulate" your tests with classifiers, and simply say how many this tests' approximations it passed.
To sum up: each test can be approximated with one binary classifier, and then the question of "What is the biggest consequtive test number that our point passes" is approximated with "what is the biggest consequtive classifier number that out point is classified as true".
Solution 2 - simple regression
You can also simply apply regression from your input space into the number of tests it reaches. Regression actually has an imprinted assumption, that the output values are correlated. So if you train your data with pairs (x,y) where y is the number of last test passed by x, then you are actually using the fact, that the output y=3 is highly related to first getting y=2 in the computations. Such regression (non-linear!) could be simply done using neural networks (possibly regularized)

Resources