I want to print the results of spatial regression models built with the spdep package into a nice table, but apparently conventional packages (e.g. stargazer, gtsummary, sjPlot, vtable) don't work well or don't function at all with these types of models. Is there any way to do this properly?
Related
I'm developing an application that calculate the similarity between a query and a list of products but not using the full sentences but only the features ( named entities ) extracted from the query and the products, i already have a trained ner model fine tuned on distelbert model using spacy transformers library, so i have access to word embeddings for my sentences and the extracted named entities.
So i want to calculate the similarity between the query and the produts i have in my database using the word embeddings of only their extracted entities, this way i can focus only on the features that the user is looking for and not the whole query, this is just a theory i'm testing and would like to see the results, so now my problem is how to do that ?
I have run a multivariable regression analysis using glm.mids() function after multiple imputation with mice on a large dataset.
I should now calculate the interaction between two of the covariates included in the model. What I would need are the results that can be obtained using allEffects() function, but it does not work after multiple imputation.
Is there any other package providing odds ratios for the interaction model obtained using glm.mids()?
I have recently been looking into different filter feature selection approaches and have noted that some are better suited for numerical data (Pearson) and some are better suited for categorical data (Chi-Square).
I am working with a dataset with a mixture of both data types and am unsure about what the best practice is in terms of applying the filter methods.
Is it best to split the dataset into categorical and numerical, performing different filter methods on each set and then joining the results?
Or should only one filter method be applied to the whole dataset?
You can have a look at Permutation Importance. The idea is to randomly shuffle the values of a feature and observe the change in error. If the feature is important, ideally the error should increase. It does not depend on the data type of the feature, unlike some statistical tests. Also it is very straightforward to implement and analyze. link1, link2
I am currently learning to do stacking in a machine learning problem. I am going to get the outputs of the first model and use these outputs as features for the second model.
My question is: Does the order matter? I am using a lasso regression model and a boosted tree. In my problem the regression model outperforms the boosted tree. I am thinking therefore that I should use the regression tree second and the boosted tree first.
What are the factors I need to think about when making this decision?
Why don't you try feature engineering to create more features?
Don't try to use predictions from one model as features for another model.
You can try using K-means to cluster similar training samples.
For stacking, just use different models and then average the results (assuming that you have a continuous y variable).
I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.