FBProphet: Understanding Regressor Impact on Multivariate Forecast - time-series

Please see this example as the project I am working on is quite similar, but with ~8 regressors instead of 2 and I need to understand how each regressor is impacting the forecast model: https://towardsdatascience.com/forecast-model-tuning-with-additional-regressors-in-prophet-ffcbf1777dda
Given a scenario like above with 2 additional regressors: How can we understand the impact of each regressor on the 'yhat' forecast (ex. 'temp' has 30% impact on yhat prediction and 'weathersit' has 70% impact on yhat prediction or something similar) . I have tried using "from fbprophet.utilities import regressor_coefficients" to see regressor coefficients but I'm not sure if that's the right approach.
Additionally, how to interpret regressor columns in the 'forecast' dataframe from '.predict()'?
Thanks for your help.

After running regressor_coefficients(model), you will get the center and coef of each additive regressor. For example:
regressor_coefficients(my_model)
|--|regressor| regressor_mode| center| coef_lower| coef| coef_upper|
|--|---------- |------------------|--------|------------|------|------------|
|0 |temperat|additive | 6.346457 | -51.124462| -51.124462| -51.124462|
|1 |humidity |additive | 66.665910| 7.736604| 7.736604| 7.736604|
So the results from your prediction should be (for additive seasonal trends):
yhat = trend + yearly + extra_regressors_additive,
where
extra_regressors_additive = (temperature_data - temperature_center)*temperature_coef
+ (humidity_data - humidity_center )* humidity_coef

You can have more details about the regressors in the "forecast" dataframe. Look for the columns that represent your regressor name. If you feel that fbprophet is under estimating the impact of your regressor, you can declare your regressor input values as binary instead. You can also clusterize you regressor input values if binary values are not appropriate. If you still feel that your regressor is under estimated, have a look at you historical data of your regressor. Does the y value increase the same day your regressor behaviour change? If not then you need to fix that.
You can also refer to the section "Coefficients of additional regressors" of this website: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#additional-regressors

Related

ML accuracy for a particular group/range

General terms that I used to search on google such as Localised Accuracy, custom accuracy, biased cost functions all seem wrong, and maybe I am not even asking the right questions.
Imagine I have some data, may it be the:
The famous Iris Classification Problem
Pictures of felines
The Following Dataset that I made up on predicting house prices:
In all these scenario, I am really interested in the accuracy of one set/one regression range of data.
For irises, I really need Iris "setosa" to be classified correctly, really don't care if Iris virginica and Iris versicolor are all wrong.
for Felines, I really need the model to tell me if you spotted a tiger (for obvious reason), whether it is a Persian or ragdoll or not I dont really care.
For the house prices one, i want the accuracy of higher-end houses error to be minimised. Because error in those is costly.
How do I do this? If I want Setosa to be classified correctly, removing virginica or versicolour both seem wrong. Trying different algorithm like Linear/SVM etc are all well and good, but it only improves the OVERALL accuracy. But I really need, for example, "Tigers" to be predicted correctly, even at the expense of the "overall" accuracy of the model.
Is there a way to have a custom cost-function to allow me to have a high accuracy in a localise region in a regression problem, or a specific category in a classification problem?
If this cannot be answered, if you could just point me to some terms that i can search/research that would still be greatly appreciated.
You can use weights to achieve that. If you're using the SVC class of scikit-learn, you can pass class_weight in the constructor. You could also pass sample_weight in the fit-method.
For example:
from sklearn import svm
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = svm.SVC(class_weight={0: 3, 1: 1, 2: 1})
clf.fit(X, y)
This way setosa is more important than the other classes.
Example for regression
from sklearn.linear_model import LinearRegression
X = ... # features
y = ... # house prices
weights = []
for house_price in y:
if house_price > threshold:
weights.append(3)
else:
weights.append(1)
clf = LinearRegression()
clf.fit(X, y, sample_weight=weights)

Problem with XGboost Classification & eli5 package

When training an XGBoost classification model, I am using the eli5 function "explain_prediction()" to look at the feature contributions to invidividual predictions.
However, the eli5 package seems to be treating my model as a regressor rather than a classifier.
Below is a snippet of code, showing my model, my prediction, and then the output from the "explain_prediction" method.
As you can see, the output gives a score that is 3.016 rather than a probability between 0 and 1. In this case I would have expected 0.953.
Any help appreciated.
the eli5 package seems to be treating my model as a regressor rather than a classifier.
The boosting score is converted to the probability score by applying the inverse logit function to it.
The probability scale is non-linear, which would make the numeric interpretation of feature contributions more difficult.
.. the output gives a score is 3.016 .. I would have expected 0.953
1 / (1 + exp(-3.016)) = 0.9532917416863492

Can i use dataframe with sparse vector to do cross-validation tuning?

i'm training my multilayer Perceptron Classifier. Here's my training set.The features are in sparse vector format.
df_train.show(10,False)
+------+---------------------------+
|target|features |
+------+---------------------------+
|1.0 |(5,[0,1],[164.0,520.0]) |
|1.0 |[519.0,2723.0,0.0,3.0,4.0] |
|1.0 |(5,[0,1],[2868.0,928.0]) |
|0.0 |(5,[0,1],[57.0,2715.0]) |
|1.0 |[1241.0,2104.0,0.0,0.0,2.0]|
|1.0 |[3365.0,217.0,0.0,0.0,2.0] |
|1.0 |[60.0,1528.0,4.0,8.0,7.0] |
|1.0 |[396.0,3810.0,0.0,0.0,2.0] |
|1.0 |(5,[0,1],[905.0,2476.0]) |
|1.0 |(5,[0,1],[905.0,1246.0]) |
+------+---------------------------+
Fist of all, i want to evaluate my estimator on a hold out method, here's my code:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
layers = [4, 5, 4, 3]
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
param = trainer.setParams(featuresCol = "features",labelCol="target")
train,test = df_train.randomSplit([0.8, 0.2])
model = trainer.fit(train)
result = model.transform(test)
evaluator = MulticlassClassificationEvaluator(
labelCol="target", predictionCol="prediction", metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(result)))
But it turns out the error:Failed to execute user defined function($anonfun$1: (vector) => double). Is this because i have sparse vector in my features?What can i do?
And for the cross-validation part, I coded as following:
X=df_train.select("features").collect()
y=df_train.select("target").collect()
from sklearn.model_selection import cross_val_score,KFold
k_fold = KFold(n_splits=10, random_state=None, shuffle=False)
print(cross_val_score(trainer, X, y, cv=k_fold, n_jobs=1,scoring="accuracy"))
And I get: it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
But when i look up the document, i didn't find get_params method.Can someone help me with this?
There is a number of issues with your question...
Focusing on the second part (it is actually a separate question), the error message claim, i.e. that
it does not seem to be a scikit-learn estimator
is indeed correct, since you are using the MultilayerPerceptronClassifier from PySpark ML as trainer in the scikit-learn method cross_val_score (they are not compatible).
Additionally, your 2nd code snippet is not at all PySpark-like, but scikit-learn-like: while you use correctly the input in your 1st snippet (a single 2-column dataframe, with the features in one column and the labels/targets in the other), you seem to have forgotten this lesson in your second snippet, where you build separate dataframes X and y for input to your classifier (which should be the case in scikit-learn but not in PySpark). See the CrossValidator docs for a straightforward example of the correct usage.
From a more general viewpoint: if your data fit in the main memory (i.e. you can collect them as you do for your CV), there is absolutely no reason to bother with Spark ML, and you would be far better off with scikit-learn.
--
Regarding the 1st part: the data you have shown seem to have only 2 labels 0.0/1.0; I cannot be sure (since you show only 10 records), but if indeed you have only 2 labels you should not use MulticlassClassificationEvaluator but BinaryClassificationEvaluator - which however, does not have a metricName="accuracy" option... [EDIT: against all odds, seems that MulticlassClassificationEvaluator indeed can work for binary classification, too, and it is a handy way to get the accuracy, which is not provided with its binary counterpart!]
But this is not why you get this error (which, BTW, has nothing to do with the evaluator - you get it with result.show() or result.collect()); the reason for the error is that the number of nodes in your first layer (layers[0]) is 4, while your input vectors are evidently 5-dimensional. From the docs:
Number of inputs has to be equal to the size of feature vectors
Changing layers[0] to 5 resolves the issue (not shown). Similarly, if you indeed have only 2 classes, you should also change layers[-1] to 2 (you'll not get an error if you don't, but it won't make much sense from a classification point of view).

sklearn - Predict each class's probability

So far I have resourced another post and sklearn documentation
So in general I want to produce the following example:
X = np.matrix([[1,2],[2,3],[3,4],[4,5]])
y = np.array(['A', 'B', 'B', 'C', 'D'])
Xt = np.matrix([[11,22],[22,33],[33,44],[44,55]])
model = model.fit(X, y)
pred = model.predict(Xt)
However for output, I would like to see 3 columns per observation as output from pred:
A | B | C
.5 | .2 | .3
.25 | .25 | .5
...
and a different probability for each class showing up in my prediction.
I believe that the best approach would be Multilabel classification from the second link I provided above. Additionally, I think it might be a good idea to hop into one of the multi-label or multi-output models listed below:
Support multilabel:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neural_network.MLPClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
sklearn.linear_model.RidgeClassifierCV
Support multiclass-multioutput:
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier
However, I am looking for someone who is has more confidence and experience at doing this the right way. All feedback is appreciated.
-bmc
From what I understand you want to obtain probabilities for each of the potential classes for multi-class classifier.
In Scikit-Learn it can be done by generic function predict_proba. It is implemented for most of the classifiers in scikit-learn. You basically call:
clf.predict_proba(X)
Where clf is the trained classifier.
As output you will get a decimal array of probabilities for each class for each input value.
One word of caution - not all classifiers naturally evaluate class probabilities. For instance, SVM doesn't do that. You still can obtain the class probabilities though, but to do that upon constructing such classifiers you need to instruct it to perform probability estimation. For SVM it would look like:
SVC(Probability=True)
After you fit it you will be able to use predict_proba as before.
I need to warn you that if classifier doesn't naturally evaluate probabilities that means that the probabilities will be evaluated using rather expansive computational methods which may significantly increase training time. So I advice you to use classifiers which naturally evaluate class probabilities (neural networks with softmax output, logistic regression, gradient boosting etc)
Try to use calibrated model:
# define model
model = SVC()
# define and fit calibration model
calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(trainX, trainy)
# predict probabilities
print(calibrated.predict_proba(testX)[:, 1])

Neuronal Network for 10 inputs and 10 outputs

I've got a physical problem: To construct a product 10 output parameters (width, length, material, etc.) are determined based on 10 input parameters (performance, temprature, capacity, etc..). The output parameters are obviously depended from the input parameters. But I don't know how. For example output parameter O1 could be dependend from input parameters I1, I2 and I3.
I've got the data of lets say 30k products with their input/output parameters. The data base looks like this:
----------------------------------------------
| Product| I1 | I2 | I3 | ... | O1 | O2 | 03 |
----------------------------------------------
| Prod A | 1.2| 2.3| 4.2| ... | 5.3| 6.2| 1.2|
----------------------------------------------
| Prod B | 2.3| 4.1| 1.2| ... | 8.2| 5.2| 5.0|
----------------------------------------------
| Prod C | 6.3| 3.7| 9.1| ... | 3.1| 4.1| 7.7|
----------------------------------------------
| ... | |
----------------------------------------------
So what I need to do is to find ouput parameters O 1-O 10 based on input parameters I 1 - I 10.
First Question: If I get it right, this is a regression problem, based on some input values I want to find some output values (in the data there is somewhere a function/formular to determin the correct values). Is this correct?
My idea is to use/train a neuronal network (using keras and tensorflow as backend)
How would such a neuronal network look like? What is the best practice?
This is what I have so far:
Input layer with 10 inputs, two full connected deep layers with 100 neurones and an layer with 10 outputs. In keras this looks like this:
def baseline_model(self, callback):
model = Sequential()
model.add(Dense(100, input_dim=10, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(10))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=["accuracy"])
model.fit(input_train, output_train, batch_size=5, epochs=2000, verbose=2, callbacks=[callback], shuffle=True, validation_data=(input_val,output_val))
scores = model.evaluate(input_val, output_val, verbose=1)
print("Scores:",scores)
Of course the model does not work like expected, thats why I'm asking for help... the training failes:
Epoch 1999/2000
7s - loss: 47634520366153.6016 - acc: 0.0000e+00 - val_loss: 9585392308285.4395 - val_acc: 0.0000e+00
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Thanks for any advice
If I get it right, this is a regression problem, based on some input values I want to find some output values
Yes, i think you are right.
How would such a neuronal network look like? What is the best practice?
It's very broad question. i think you should split your data into train and validation set, start from simplest network (maybe no hidden layer or only one hidden layer) and then make it more and more complicated (add more layers and hidden units) while your validaton error decreases. When your net become quite deep it's good idea to add Batch Normalization layers between your dense layers. You can also look at residual connections but not sure that you really need this.
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Activation function type depends on your outputs type. For categorical outputs sigmoid/softmax probably good choice, linear should be ok for floating numbers.
Also if one of your inputs is categorial (material type, for example) maybe it's better to split it into several binary inputs.
It's almost always good idea to normalize your inputs and outputs. Non normalized data could really hurt training process.
Plot error and check how it changes during time. loss: 47634520366153.6016 is really big but it tell us not so much about optimization. If it decreases maybe you can increase learning rate. If it grows try to decrease learning rate or try another optimization algorithm.
Check your gradients, if it too big try to use gradient clipping.
Also try to start from simple model. Maybe from linear regression.
Strongly speaking neural neutwork debugging is big and complicated field, and i am not sure that it's appropriate for stackoverflow discussion
PS Sorry for my English
As #Dark_davier has already said, this is a field where you need some experience. Is not really possible to answer without really doing some tests. But as guideline be careful with the size of your network. In your network you have roughly (some more) 10e4 parameters, and you said you have "only" 30k observations. So there is a high probability of overfitting... So you need to be careful. You would need to use more sophisticated techniques to avoid it (first cross validation to check, then possibly regularisation). But this require some experience in NN optimisation...

Resources