I was just curious to know if we have a model that takes into account, both of the factors for time series prediction (e.g. predicting future Sales).
The problem is that if we are using something like ARIMA, then it doesn't consider the important information (like the new promotions added by Company or may be the other factors like product type etc).
And on the other hand, if I am using the machine learning models like Random Forests, then I loose the information of trends and seasonality.
Do we have something that combines both of these?
ARIMA models can take additional information besides the time series data itself. These are called causal variables or exogenous variables. See ARMAX and ARIMAX models.
It is a little bit more complicated with Exponential Smoothing type models (Holt, Holt-Winters, etc...).
Machine learning models can be used for for times series data too, you just need to format the data in the right way.
For a traditional time series model, the data looks like this:
Train Test
[1, 2, 3], [4]
[1, 2, 3, 4], [5]
[1, 2, 3, 4, 5], [6]
[1, 2, 3, 4, 5, 6], [7]
You can reformat the data so that it looks like a supervised learning problem:
Train: [1, 2, 3 | 4]
[2, 3, 4 | 5]
[3, 4, 5 | 6]
-----------------------
Test: [4, 5, 6 | 7]
The you can apply most of the supervised ML methods.
Note however, that for ML models the time series input will always be a fixed number of lags (compared to sequential models like exponential smoothing.)
Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?
To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.
The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.
I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)
Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])
My question: Why does my model fail to learn to play this game of just producing an array of unique elements from 1 to 5 from a partially filled array?
===
I am trying to train a model to perform this task:
Given a fixed array of 5 elements consisting of at most ONE of each element from (1, 2, 3, 4, 5) and ONE OR MORE (0), replace the 0s with appropriate values so that the final array has exactly ONE of each (1, 2, 3, 4, 5).
So, here is how it should be played:
[1, 2, 3, 4, 0] => [1, 2, 3, 4, 5]
[4, 3, 0, 5, 1] => [4, 3, 2, 5, 1]
[0, 3, 5, 4, 0] => [2, 3, 5, 4, 1] OR [1, 3, 5, 4, 2]
...
This is not a complicated game (in human sense), but I want to see if a model can identify the rules (replace 0s with 1 to 5, so that final array has only exactly one element from (1, 2, 3, 4, 5)).
The way I did this is:
Generate N examples of combinations configurations with elements of [1, 2, 3, 4, 5] as answers, and randomly replace some of the elements as 0s.
For instance, one training example is [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)].
There can be multiple same input mapping to different output, i.e. [(0, 3, 5, 4, 0), (2, 3, 5, 4, 1)] and [(0, 3, 5, 4, 0), (1, 3, 5, 4, 2)] can be both present as two separate training instances.
Separate the training data set 10 fold, shuffled, and train using a RandomForestClassifier from Scikit-Learn.
A correct output is defined as the final configuration array has exactly ONE element from (1, 2, 3, 4, 5). So (2, 4, 4, 5, 1) is not valid.
===
Surprisingly, using 1000, 10000, 50000, and even 100000 training examples still results in the model only getting ~70% of the test cases right - meaning the model did not learn how to play the game with increasing training examples.
One thing I was thinking is that RandomForestClassifier is just not used for this type of problem, called structured machine learning, where the output is not a single category or a real-valued output, but a vector of output.
More questions:
Why does the model fail to learn this
game?
Is this the right way to model
this problem?
Is the data not enough
to learn this task? But increasing
data from 1000 to 100000 does not
seem to help at all.
Thank you!
lejlot's answer is excellent, but I thought I'd add a bit of intuition as to why random forest fails in this case.
You have to keep in mind that Machine Learning isn't some magic way to impart intelligence to computers; it's simply a way of fitting a particular model to your data and using that model to make generalizations. As the old adage goes, "all models are wrong, but some are useful". You've hit on a case where the model is wrong as usual, but also happens to be useless!
The output space: Random forests at their core are basically a clever and generalizable way of mapping inputs to outputs. Your output space has 5^5 = 3125 possible unique outputs, and only 5! = 120 of these are valid (i.e. outputs with one of each number). The only way for a random forest to know whether an output is valid is if it has seen it: so in order to work correctly, your training set will have to include examples with all of those 120 outputs.
The input space: when a random forest encounters an input it has seen before, it will map that directly to an output that it has seen before. But what if it encounters an input it has not seen? For example, what if you ask for the answer to [0, 2, 3, 4, 1] and this is not in the training set? In terms of Euclidean distance (a useful way to think about how things are grouped) the closest result will probably be something like [0, 2, 3, 4, 0], which might map to [1, 2, 3, 4, 5], which is wrong. Thus we see that in order for random forests to work correctly, your training set will have to have all possible inputs. Some quick combinatorics show that your training set will have to be of size at least 5!*32 = 3840, with no duplicates.
The forest itself: even if you have a complete input space, the random forest does not consist of a simple dictionary mapping of inputs to outputs. Depending on the parameters of the model, the mapping is typically from groups of nearby results to a single answer, so that, for example, {[1, 2, 3, 4, 5], [1, 0, 3, 4, 5], [0, 1, 3, 4, 5]...} will all map to [1, 2, 3, 4, 5]. This sort of generalization is useful in most cases, but is not useful for your particular problem. The only way for the random forest to work in your case would be to push the max_depth and min_samples parameters to their extreme values, so that the forest is essentially a one-to-one mapping of inputs to their correct outputs: in other words your classifier would be just an extremely complicated way of building a dictionary.
To summarize: Machine Learning is just a model applied to data, which is useful in certain cases. In your case, the model is not all that useful: in order for Random Forests to work on your problem, you'd need to over-fit a comprehensive set of inputs and outputs. At that point, you might as well just construct a dictionary and call it a day.
I do assume that this is just a mind-exercise, and not actual problem, because obviously - set-based solution will be better then any ML technique in such task.
In short - because classifiers/regressors are not for combinatorial optimization. Your problem has extremely strong constraints - only very small number of values are "correct" and "observable", you look for a property of the output, and not the value. These is not setting for classification or regression.
What can you do?
in such contrained scenario you have to give your method knowledge about what is going on. Show it a state space. This is rather a case for simple state-space AI, not for ML as such - rather for any metaoptimizations, like hill climbing, simulated annealing, ga etc.
look at things like General Game Playing, this is somehow similar, but the important difference is that you provide set of rules.
look at things like Neural Turing Machines, these are sequential methods trying to learn how to manipulate the data instead of classification/regression
In general this is a very common missconception when one tries to learn machine learning. Not every problem is suitable for "just applying" known ML techniques. Most of the problems "out there" require considerable input from researcher to be able to explot the strength of ML.