Predict future values (time series) Iteratively on subsets of data within a dataframe - future

I am trying to predict future values iteratively on multiple subsets of dataframe. How do I split dataframe into chunks based on unique value(Key) and then apply moving average on subset of data and not on entire dataframe. For example: I have actual scores for IndiaPrint and ANZLaptop and I want to predict future values as highlighted in yellow.
enter image description here

Related

Normalize time-series data before or after split of training and testing data?

I use a classification model on time-series data where I normalize the data before splitting the data into train and test. Now, I know that train and test data should be treated separately to prevent data leaking. What could be the proper order of normalization steps here? Should I apply steps 1,2,3 separately to train and test after I split data with the help of a sliding window? I use a sliding window here to compare each hour (test) with its previous 24 hrs data (train). Here is the order that I am currently using in the pipeline.
Moving averages (mean)
Resampling every hour
Standardization
Split data into train and test using a sliding window (of a length 24 hrs (train) and slides every 1 hr (test))
Fit the model using train data
Predict using the test data
Steps 1 and 2 can be done safely, you just should take into account that The moving average must use only past values: X'i = mean(Xi, Xi-1, Xi-2, ..., Xi-n).
However, in step 3, the normalization/standardization parameters, like max and min if you are using minmax scaler or mean and standard deviation if you are using standardization, should be computed from the training data and should be applied to the whole dataset, so your pipeline would be something like this
Moving average (using only past values)
Resampling every hour
Split data into train and test.
Get standardization parameters from the train data (mean and std).
Standardize the whole dataset (train and test) using the parameters computed in 4.
Fit the model using train data
Predict using the test data

TIme Series forcasting

I've been following a lot of tutorials using lstms to forecast timeseries data. My question is that how do we predict on new data that is not part of the dataset since almost all the tutorials show the predict function in Keras being used on the test data split.
How do we actually forecast into the future?
Usually, you create your training data such that the model receives n points and predict the following m points. Once you have your model trained, you take the last n available points of your dataset or new points from the present, and the model will output a prediction of m points in the future.
If you want to predict more than m points in the future, you could predict m points and use it as input to predict another m points, and so on. However, you should be aware that using this technique you will probably get worse results as you are accumulating errors.

Is it possible to do rolling operations on Dask DataFrames where the entire DataFrame slice is passed to apply?

I am trying to do a rolling operation on a Dask DataFrame, and need to apply a function on two columns of the DataFrame (calculate a cross-correlation). DataFrame rolling appears to operate on columns separately and sequentially. Is there a way to roll through a DataFrame and give the applied function access to more than one column of the DataFrame?

LSTM prediction output for multiple values given a certain time step

I came across a time series prediction problem where I have a dataset with multiple entries. Each entry represents a value of a certain category in a given time. All the entries are indexed by their timestamp. The entries are separated by a constant time (2 minutes in my case). My goal is to predict all or a subset of the dataset values given a timestamp in the future. However, the majority of the tutorials online are focusing on predicting a single value from the dataset.
My question: Can an LSTM be used to model such problem ?
If I understood correctly, you have the beginning (of some length) of a sequence of values and you need to predict how the sequence continues from there for potentially multiple steps.
You could do that with e.g. a sequence to sequence model. See
www.tensorflow.org/versions/r0.12/tutorials/seq2seq/
Or
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Continuous or categorical data in data science

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

Resources