The log-likelihood function of ARMA model after the deprecation of the statsmodels.tsa.arima_model.ARMA - time-series

I am following a tutorial building an AR model using ARMA model from statsmodels.tsa.arima_model, I got an error that it has been deprecated, and it is replaced by ARIMA from statsmodels.tsa.arima.model. I am trying to write a function that calculate a test statistic of the log-likelihood ratio test, which requires the use of the llf attribute of an initialized ARMA model. I do not know what is the counterpart of llf attribute in ARIMA model?
I tried statsmodels.tsa.arima.model.ARIMA.fit().loglike() but I did not know what kind of parameter it should get?
ar_2_model = ARIMA(data, order=(2,0,0) ).fit()
I tried ar_2_model.loglike(), but I could not figure out what kind of parameters to pass to it.
I want to get the value of the log-likelihood function of ar_2_model as similar to if I would have used
ar_3_model = ARMA(data. order=(2,0)).fit()
ll_f = ar_3_model.llf

You can still use llf:
ar_2_model = ARIMA(data, order=(2,0,0) ).fit()
print(ar_2_model.llf)

Related

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How to fetch values using Permutation Feature Importance

I have a dataset with 5K (and 60 features) records focused on binary classification.
Please note that this solution doesn't work here
I am trying to generate feature importance using Permutation Feature Importance. However, I get the below error. Can you please look at my code and let me know whether I am making any mistake?
import eli5
from eli5.sklearn import PermutationImportance
logreg =LogisticRegression()
model = logreg.fit(X_train_std, y_train)
perm = PermutationImportance(model, random_state=1)
eli5.show_weights(perm, feature_names = X.columns.tolist())
I get an error like as shown below
AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'
Can you help me resolve this error?
If you look at your attributes of PermutationImportance object via
ord(perm)
you can see all attributes and methods BUT after you fit your PI object, meaning that you need to do:
perm = PermutationImportance(model, random_state=1).fit(X_train,y)

Is there a way to specify dispersion parameter in glmer (binomial family)?

I am trying to get estimates and standard errors from a glmer model. Once I know the disersion parameter I want to call summary and get the information. Is there a way to do it?
EDITED: I have seen that you can do what I want for glm with this code: summary(model, dispersion = 2.5) for model inference. However, it does not work for glmer models.

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.
I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.
I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:
p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
But I get the following error:
model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given
My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.
What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?
Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.
The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.
class MyMLB(MultiLabelBinarizer):
def fit_transform(self, X, y=None):
return super(MultiLabelBinarizer, self).fit_transform(X)
Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.
I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!
Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.
X = <training_data>
y = <training_labels>
# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)
p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)
# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
pickle.dump((mlb, p), f)
Then, after shipping the .pkl files to the remote box...
# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
mlb, p = pickle.load(f)
X = <input data> # Get your input data from somewhere.
# Predict the classes using the pipeline
mlb_predictions = p.predict(X)
# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)
# Do something with predicted classes.
<...>
Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.
Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

Train and test set are not compatible error in weka?

I'm trying to test my model with new dataset. I have done the same preprocessing step as i have done for building my model. I have compared two files but there is no issues. I have all the attributes(train vs test dataset) in same order, same attribute names and data types. But still i'm not able to resolve the issue. Both of the files train and test seems to be similar but the weka explorer is giving me error saying Train and test set are not compatible. How to resolve this error? Is there any way to make test.arff file format as train.arff? Please somebody help me.
The same with the comment that I left after problem statement:
All the three attributes are nominal attributes followed by all the possible values quoted by '{}'. One of my guess is that the possible values are not the same. For example, for RESOURCE attribute there is no 199 in test file, while it is in training-file.
After struggling with the same problem for a day. I figured out two ways to make the trained model working on supplied test set.
Method 1.
Use knowledge flow. For example something like below:
CSVLoader(for train set) -> classAssigner -> TrainingSetMaker -->(classifier of your choice) -> ClassfierPerformanceEvaluator - TextViewer.
CSVLoader(for test set) -> classAssigner -> TestgSetMaker -->(the same classifier instance above) -> PredictionAppender -> CSVSaver.
Then load the data from the CSVLoader or arffLoder for the training set. The model will be trained.
After that load data from the loader for the test set. It will evaluate the model(classifier, for example) on the supplied test set and you can see the result from the textviewer (connected to the ClassifierPerformanceEvaluator) and get the saved result from the CSVSaver or arffSaver connected to the PredictionAppender.An additional column, the "classfied as" will be added to the output file.
In my case, I used "?" for the class column in the supplied test set if the class labels are not available.
Method 2.
Combine the Training and Test set into one file. Then the exact same filter can be applied to both training and test set. Then you can separate training set and test set by applying instance filter. Since I use "?" as class label in the test set. It is not visible in the instance filter indices. Hence just select those indices that you can see in the attribute values to be removed when apply the instance filter. You will get the test data left only. Save it and load it in supply test set at the classifier page.This time it will work. I guess it is the class attribute that causes the NOT compatible train and test set issue. As many classfier requires nominal class attribute. The value of which is converted to the index to available values of the class attribute according to
http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F
See following answer, your train.arff and test.arff should have same header. According to your comparison they are similar but not same.
I just encountered the same problem and I found a bare-bones solution. The format of my file is .csv and I simply open my files(for training and testing,respectively) and use the save button on the Preprocess panel of WEKA to save them in .arff format.
Then the problem is solved.
Look there is a difference between similar and same, your train.arrf and test.arrf should have the same header and if not then you should copy the header of train.arrf and paste it in your test.arrf as a new header.
trainPath = ""
otherPadelPath = ""
testPath = ""
trainFile = open(trainPath,"r")
trainAttributes = trainFile.readlines()[0].split(",")
trainFile.close()
otherPadelFile = open(otherPadelPath,"r")
otherPadelLines = otherPadelFile.readlines()
otherPadelFile.close()
otherPadelColumns = []
testLines = []
for attribute in trainAttributes:
if attribute in otherPadelLines[0].split(","):
otherPadelColumns += [otherPadelLines[0].split(",").index(attribute)]
for line in otherPadelLines:
rearrangedLine = []
for inDex in otherPadelColumns:
rearrangedLine += [line.split(",")[inDex]]
testLines += [",".join(rearrangedLine)]
testFile = open(testPath,"w")
testFile.writelines(testLines)
testFile.close()
This script can rearrange your test dataset to contain the same order/number of attribute columns in your training set, provided that each attribute has the same type and title. Also, (in keeping with WEKA default), the class attribute should be in the last column for both datasets.

Resources