LSTM for sequence prediction given a feature

LSTM for sequence prediction given a feature - time-series

I have to predict daily footfall in the carnival using the historical daily sequence of the footfall given the theme of the park on that day. The data is shown below-
I want to implement many to many LSTM to predict footfall for 9,10,11 given the themes for these days. The above table is just for the understanding of the data and the problem.
It will be really helpful if you can give me the approach to this problem. Thanks.

To my understanding an LSTM is simply not suited for this task. Basically an LSTM is pretty good for any kind of sequences, but as far as I understand your problem the footfall is only dependent on the theme. It does not matter what themes appeared on the previous days, thus it is not a sequence because a certain theme results in a certain footfall.
Furthermore you have way to few data for a neural network... Anyway.
The most naive approach would be to calculate the average for your three classes c, x and y. That is the expectation for the days 9, 10 and 11.
The expectations are:
c = (24 + 15 + 33) / 3 = 24
x = (20 + 32 + 17) / 3 = 23
y = (13 + 22) / 2 = 17,5
If you really want to use an neural network do the following:
1. Convert your three classes to one-hot representations (those are your classes)
e.g. c = (0 0 1), y = (0 1 0), x = (1 0 0); that is your input
2. Convert your numbers into a binary format. that is your label
3. Run the network

Related

Is there a way to implement a Neural Network able to work with vector target?

I'm trying to implement a Neural network model using keras, where the output is a vector of five elements.
Basically the target contains elements from 0 to 4 and nan. So I can have some targets like
[0,3,2,1,4] and others like [nan, 0, nan, 1 ,2]. The important thing is that the element in the vector are not repeated, only nan can.
One solution I tried was to use something like one hot encoder for the target, in this way I transformed a target in a 25 components vector, with all zeros and 1 in corrispondence of the number to map ( i.e. [nan, 0, nan, 1 ,2] -> [(0 , 0 ,0 ,0 ,0),(1,0,0,0,0),(0,0,0,0,0),(0,1,0,0,0)(0,0,1,0,0)] - i'm using the round brackets only to highlight groups of five element).
Any ideas please?

As far as I have understood, what you're trying to predict is a list of 5 elements, each of them takes a discrete value from the {nan, 0, 1, 2, 3, 4}.
What you'll need to do is training 5 neural networks (for each position of the list), each one predicts a value from the previous set, thus, you need to hot-encode the outputs, apply a softmax and select the highest probability for each of neural network.
when trying to predict the output list of a new sample, what you're going to do is predict every position, put them in the list and Voila !
def predict_sample(sample):
pos_0 = nn0.predict(sample)[0]
pos_1 = nn1.predict(sample)[0]
pos_2 = nn2.predict(sample)[0]
pos_3 = nn3.predict(sample)[0]
pos_4 = nn4.predict(sample)[0]
outp = [pos_0, pos_1, pos_2, pos_3, pos_4]
# if nan is encoded as 5 then:
outp[outp == 5] = np.nan
return outp
You cannot assume NaNs will be unique at prediction, only the data will affect that but what you can do if for example taking the second highest probability when already a NaN is predicted at a certain position of the list.

Classification with imbalanced dataset using Multi Layer Perceptrons

I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?

I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5

You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.

That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.

k-means for all data or for each feature?

I want use k-means to discretize a time series data in two values (0 or 1). My time series data is a matrix time per genes (line = time, column = gene). Ex:
t\x x1 x2 x3
1 0.122 0.324 0.723
2 0.543 0.573 0.329
3 0.901 0.445 0.343
4 0.612 0.353 0.435
5 0.192 0.233 0.023
My question: Should I use k clusters for all data of matrix or k clusters for each column (so I will have k cluster per column totalizing k.number_columns)? and my genes are independents

Either may work.
Discretising all attributes at once has the benefit of giving you only one symbol per time, i.e. a univariate series.
But on the other hand, if columns are independent, the quality could be better if you discretise them individually. Note thatfor one-dimensional data, if it is noisy, quantiles may be much better than k-means (which is sensitive to noise).

Cross validation is very slow in Grid search (libsvm)

I am using libsvm on 62 classes with 2000 samples each. The problem is i wanted to optimize my parameters using grid search. i set the range to be C=[0.0313,0.125,0.5,2,8] and gamma=[0.0313,0.125,0.5,2,8] with 5-folds. the crossvalition does not finish at the first two parameters of each. Is there a faster way to do the optimization? Can i reduce the number of folds to 3 for instance? The number of iterations written keeps playing in (1629,1630,1627) range I don't know if that is related
optimization finished,
#iter = 1629 nu = 0.997175 obj = -81.734944, rho = -0.113838 nSV = 3250, nBSV = 3247

This is simply expensive task to find a good model. Lets do some calculations:
62 classes x 5 folds x 4 values of C x 4 values of Gamma = 4960 SVMs
You can always reduce the number of folds, which will decrease the quality of the search, but will reduce the whole amount of trained SVMs of about 40%.
The most expensive part is the fact, that SVM is not well suited for multi label classification. It needs to train at least O(log n) models (in the error correcting code scenario), O(n) (in libsvm one-vs-all) to even O(n^2) (in one-vs-one scenario, which achieves the best results).
Maybe it would be more valuable to switch to some fast multilabel model? Like for example some ELM (Extreme Learning Machine)?

Do combination of existing features make new features ?

Does it help in classifying better if I add linear, non-linear combinatinos of the existing features ? For example does it help to add mean, variance as new features computed from the existing features ? I believe that it definitely depends on the classification algorithm as in the case of PCA, the algorithm by itself generates new features which are orthogonal to each other and are linear combinations of the input features. But how does it effect in the case of decision tree based classifiers or others ?

Yes, combination of existing features can give new features and help for classification. Moreover, combination of the feature with itself (e.g. polynomial from the feature) can be used as this additional data to be used during classification.
As an example, consider logistic regression classifier with such linear formula as its core:
g(x, y) = 1*x + 2*y
Imagine, that you have 2 observations:
x = 6; y = 1
x = 3; y = 2.5
In both cases g() will be equal to 8. If observations belong to different classes, you have no possibility to distinguish them. But let's add one more variable (feature) z, which is combination of the previous 2 features - z = x * y:
g(x, y, z) = 1*x + 2*y + 0.5*z
Now for same observations we have:
x = 6; y = 1; z = 6 * 1 = 6 ==> g() = 11
x = 3; y = 2.5; z = 3 * 2.5 = 7.5 ==> g() = 11.75
So now we get 2 different points and can distinguish between 2 observations.
Polynomial features (x^2, x^3, y^2, etc.) do not give additional points, but instead change the graph of the function. For example, g(x) = a0 + a1*x is a line, while g(x) = a0 + a1*x + a2*x^2 is parabola and thus can fit data much more closely.

In general, it's always better to have more features. Unless you have very predictive features (i.e. they allow for perfect separation of the classes to predict) already, I would always recommend adding more features. In practice, many classification algorithms (and in particular decision tree inducers) select the best features for their purposes anyway.

There are open-source Python libraries that automate feature creation / combination:
We can automate polynomial feature creations with sklearn.
We can automatically create spline features with sklearn.
We can combine features mathematically with Feature-engine. With MathFeatures we combine feature groups, and with RelativeFeatures we combine feature pairs.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

LSTM for sequence prediction given a feature - time-series

Related

Is there a way to implement a Neural Network able to work with vector target?

Classification with imbalanced dataset using Multi Layer Perceptrons

k-means for all data or for each feature?

Cross validation is very slow in Grid search (libsvm)

Do combination of existing features make new features ?

Categories

Resources