Large, consistent residues after STL decomposition signifies non periodicity? - time-series

I used STL decomposition on the power consumption data from an air-conditioner over a period of 10 weeks. I would expect this data to be periodic over a week. What I am able to observe from the data is that the curve has huge residual values when compared to seasonality or trend.
Does this mean that the data that I have does not accurately represent the weekly cycle of air-conditioner usage? Or is this model still good enough to be used for anomaly detection? Also, the trend seems to have a periodicity. What does this signify?
STL Decomposition of Power Consumption (freq = 1 week)

Case Closed.
It was indeed the case that the data does not accurately represent the weekly cycles of AC usage. The AC was being used on whims.

Related

How to deal with a skewed Time series data

I have hourly data of no. of minutes spent online by people for 2 years. Hence the values are distributed between 0 and 60 and also most data is either 0 or 60. My goal is to predict the number of minutes the person will spend online in the future (next day/hour/month etc.). What kind of approach or machine learning model can I use to predict this data? Can this be modelled into a regression/forecasting problem in spite of the skewness?hourly data
In the case of time series data and its prediction, it’s better to use a regression model rather than a classification or clustering model. Because it’s related to calculating specific figures.
It can be modeled into a regression problem to some extent, but more skewness means getting far from the normal probability distribution which might influence the expression into the model, lower prediction accuracy, and so forth. Anyway, any data with significant skewness cannot be regarded as well-refined data. So you might need to rearrange the samples of the data so that the skewness of the data can decrease.

Determine if one time series forecast another (in terms of trend only)

I have 2 time series, X_t and Y_t, which are on different scales.
Y_t can be 0 to infinite, while X_t is limited to 0 to 100.
How can I determine if the trend of X_t forecast the trend of Y_t? In other words if there is a peak in Xt, then the peak of Yt will follow after some lag.
If this is indeed the case, what is the lag?
I am not interested in forecasting the actual value of Yt.
Using the following chart as an illustration, the red line is Xt (which in my data the values are between 27 to 34), and the black line is Yt (which is about 40000).
I tried to use Time Lagged Pearson Correlation, but I am aware the pearson correlation (of the 2 time series) does not have the concept of time. Pearson correlation simply treats the time series as lists of data.
I have read some guides on Granger causality, but it seems this checks if (the value of) Xt is useful in forecasting the value of Yt, which is similar to a regression framework. (which I am mostly interested in forecasting the trend of Yt)
I am a newbie in time series analysis, Thanks for your time!

Isolation Forest for time series data

I just wonder if the isolation Forest (iForest) can work with time-series data. As far as I know, iForest is used for anomaly detection and it is based on randomization techniques to randomly and recursively partition the data and then save the partition in a tree structure.
I have a theoretical question. I just wonder if the iForest can work with the time series data since it is based on some randomization techniques. Would this violate the time series characteristics as the randomization may break the time dependencies?.
Isolation forest will help with detecting point anomalies by default, since in principle it is just working on the rarity of these observations.
But let’s say I am interested in anomalies in time series data. Isolation forest will be able to pick out the extreme Peaks and troughs that occur as point anomalies here but for collective anomalies, you may need to transform the data such that each observation represents a collection of observations (rolling window operations) etc.
The reason is that in time series data you are interested in additive outliers or temporal changes and thus your observations must represent that individually if you plan to use Isolation forest. But you can try other techniques such as STL decomposition, Arima, regression trees, exponential smoothing. You should find a lot of material on how to use the above for anomaly detection in time series.

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.
This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

What are the different strategies for detecting noisy data in a pile of text?

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit:
I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.
I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.
And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).

Resources