Error when trying to get tidy summaries with broomstick for multiple randomForest models - random-forest

I'm grouping my data frame and fitting each group's data with a random forest model, and then using broomstick to get tidy outputs for each group's model. I'm running into trouble when I get to tidy and augment.
I can group the data and fit the models.
library(tidyverse)
library(broomstick)
library(randomForest)
data<-data.frame(y=rep(rep(c(1,0),each=100),5),
group=rep(c("A","B","C","D","E"), each=200),
x1=rnorm(2000),
x2=rnorm(2000),
x3=rnorm(2000),
x4=rnorm(2000),
x5=rnorm(2000))
GroupModels<-data%>%
nest(data= -group)%>%
mutate(fit = map(data, ~ randomForest(y ~ ., ntree=101, mtry=2, data = .x, importance=TRUE)))
I then map glance to the fitted models and that works. I get mse and rsq for each group.
GroupModels%>%
mutate(glanced = map(fit, glance))%>%
unnest(glanced)%>%
select(-data, -fit)%>%
as.data.frame()
If I map tidy to the fitted models I get an output and a deprecation warning and I don't understand where tibble::as_tibble() should come into play.
GroupModels%>%
mutate(tidied = map(fit, tidy))%>%
unnest(tidied)%>%
select(-data, -fit)%>%
as.data.frame()
1: Problem with mutate() column tidied. i tidied = map(fit, tidy). i This function is deprecated as of broom 0.7.0 and will be
removed from a future release. Please see tibble::as_tibble().
If I map augment to the models I get an error and I'm not sure what to do with that.
GroupModels%>%
mutate(augmented = map(fit, augment))%>%
unnest(augmented)%>%
select(-data, -fit)%>%
as.data.frame()
Error: Problem with mutate() column augmented. i augmented = map(fit, augment). x argument must be coercible to non-negative
integer

I can now get augment to work using "map2", didn't know about this, but it's handy when you need both the fit and the data for a function. I guess I'll worry about the deprecation warning when it happens.
GroupModels%>%
mutate(augmented = map2(fit, data, augment))%>%
unnest(augmented)%>%
select(-data, -fit)%>%
as.data.frame()

Related

Error in predict.NaiveBayes : "Not all variable names used in object found in newdata"-- (Although no variables are missing)

I'm still learning to use caret package through The caret Package by Max Kuhn and got stuck in 16.2 Partial Least Squares Discriminant Analysis section while trying to predict using the plsBayesFit model through predict(plsBayesFit, head(testing), type = "prob") as shown in the book as well.
The data used is data(Sonar) from mlbench package, with the data being split as:
inTrain <- createDataPartition(Sonar$Class, p = 2/3, list = FALSE)
sonarTrain <- Sonar[ inTrain, -ncol(Sonar)]
sonarTest <- Sonar[-inTrain, -ncol(Sonar)]
trainClass <- Sonar[ inTrain, "Class"]
testClass <- Sonar[-inTrain, "Class"]
and then preprocessed as follows:
centerScale <- preProcess(sonarTrain)
centerScale
training <- predict(centerScale, sonarTrain)
testing <- predict(centerScale, sonarTest)
after this the model is trained using plsBayesFit <- plsda(training, trainClass, ncomp = 20, probMethod = "Bayes"), followed by predicted using predict(plsBayesFit, head(testing), type = "prob").
When I'm trying to do this I get the following error:
Error in predict.NaiveBayes(object$probModel[[ncomp[i]]], as.data.frame(tmpPred[, : Not all variable names used in object found in newdata
I've checked both the training and testing sets to check for any missing variable but there isn't any. I've also tried to predict using the 2.7.1 version of pls package which was used to render the book at that time but that too is giving me same error. What's happening?
I've tried to replicate your problem using different models, as I have encountered this error as well, but I failed; and caret seems to behave differently now from when I used it.
In any case stumbled upon this Github-issues here, and it seems like that there is a specific problem with the klaR-package. So my guess is that this is simply a bug - and nothing that can be readily fixed here!

Different access methods to Pyro Paramstore give different results

I am following the Pyro introductory tutorial in forecasting, and trying to access the learned parameters after training the model, I get different results using different access methods for some of them (while getting identical results for others).
Here is the stripped-down reproducible code from the tutorial:
import torch
import pyro
import pyro.distributions as dist
from pyro.contrib.examples.bart import load_bart_od
from pyro.contrib.forecast import ForecastingModel, Forecaster
pyro.enable_validation(True)
pyro.clear_param_store()
pyro.__version__
# '1.3.1'
torch.__version__
# '1.5.0+cu101'
# import & prepare the data
dataset = load_bart_od()
T, O, D = dataset["counts"].shape
data = dataset["counts"][:T // (24 * 7) * 24 * 7].reshape(T // (24 * 7), -1).sum(-1).log()
data = data.unsqueeze(-1)
T0 = 0 # begining
T2 = data.size(-2) # end
T1 = T2 - 52 # train/test split
# define the model class
class Model1(ForecastingModel):
def model(self, zero_data, covariates):
data_dim = zero_data.size(-1)
feature_dim = covariates.size(-1)
bias = pyro.sample("bias", dist.Normal(0, 10).expand([data_dim]).to_event(1))
weight = pyro.sample("weight", dist.Normal(0, 0.1).expand([feature_dim]).to_event(1))
prediction = bias + (weight * covariates).sum(-1, keepdim=True)
assert prediction.shape[-2:] == zero_data.shape
noise_scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
noise_dist = dist.Normal(0, noise_scale)
self.predict(noise_dist, prediction)
# fit the model
pyro.set_rng_seed(1)
pyro.clear_param_store()
time = torch.arange(float(T2)) / 365
covariates = torch.stack([time], dim=-1)
forecaster = Forecaster(Model1(), data[:T1], covariates[:T1], learning_rate=0.1)
So far so good; now, I want to inspect the learned latent parameters stored in Paramstore. Seems there are more than one ways to do this; using the get_all_param_names() method:
for name in pyro.get_param_store().get_all_param_names():
print(name, pyro.param(name).data.numpy())
I get
AutoNormal.locs.bias [14.585433]
AutoNormal.scales.bias [0.00631594]
AutoNormal.locs.weight [0.11947815]
AutoNormal.scales.weight [0.00922901]
AutoNormal.locs.noise_scale [-2.0719821]
AutoNormal.scales.noise_scale [0.03469057]
But using the named_parameters() method:
pyro.get_param_store().named_parameters()
gives the same values for the location (locs) parameters, but different values for all scales ones:
dict_items([
('AutoNormal.locs.bias', Parameter containing: tensor([14.5854], requires_grad=True)),
('AutoNormal.scales.bias', Parameter containing: tensor([-5.0647], requires_grad=True)),
('AutoNormal.locs.weight', Parameter containing: tensor([0.1195], requires_grad=True)),
('AutoNormal.scales.weight', Parameter containing: tensor([-4.6854], requires_grad=True)),
('AutoNormal.locs.noise_scale', Parameter containing: tensor([-2.0720], requires_grad=True)),
('AutoNormal.scales.noise_scale', Parameter containing: tensor([-3.3613], requires_grad=True))
])
How is this possible? According to the documentation, Paramstore is a simple key-value store; and there are only these six keys in it:
pyro.get_param_store().get_all_param_names() # .keys() method gives identical result
# result
dict_keys([
'AutoNormal.locs.bias',
'AutoNormal.scales.bias',
'AutoNormal.locs.weight',
'AutoNormal.scales.weight',
'AutoNormal.locs.noise_scale',
'AutoNormal.scales.noise_scale'])
so, there is no way that one method access one set of items and the other a different one.
Am I missing something here?
pyro.param() returns transformed parameters in this case to the positive reals for scales.
Here is the situation, as revealed in the Github thread I opened in parallel with this question...
Paramstore is no more just a simple key-value store - it also performs constraint transformations; quoting a Pyro developer from the above link:
here's some historical background. The ParamStore was originally just a key-value store. Then we added support for constrained parameters; this introduced a new layer of separation between user-facing constrained values and internal unconstrained values. We created a new dict-like user-facing interface that exposed only constrained values, but to keep backwards compatibility with old code we kept the old interface around. The two interfaces are distinguished in the source files [...] but as you observe it looks like we forgot to mark the old interface as DEPRECATED.
I guess in clarifying docs we should:
clarify that the ParamStore is no longer a simple key-value store
but also performs constraint transforms;
mark all "old" style interface methods as DEPRECATED;
remove "old" style interface usage from examples and tutorials.
As a consequence, it turns out that, while pyro.param() returns the results in the constrained (user-facing) space, the older method named_parameters() returns the unconstrained (i.e. for internal use only) values, hence the apparent discrepancy.
It's not difficult to verify indeed that the scales values returned by the two methods above are related by a logarithmic transformation:
import numpy as np
items = list(pyro.get_param_store().named_parameters()) # unconstrained space
i = 0
for name in pyro.get_param_store().keys():
if 'scales' in name:
temp = np.log(
pyro.param(name).item() # constrained space
)
print(temp, items[i][1][0].item() , np.allclose(temp, items[i][1][0].item()))
i+=1
# result:
-5.027793402915326 -5.0277934074401855 True
-4.600319371162187 -4.6003193855285645 True
-3.3920585732532835 -3.3920586109161377 True
Why does this discrepancy affect only scales parameters? That's because scales (i.e. essentially variances) are by definition constrained to be positive; that doesn't hold for locs (i.e. means), which are not constrained, hence the two representations coincide for them.
As a result of the question above, a new bullet has now been added in the Paramstore documentation, giving a relevant hint:
in general parameters are associated with both constrained and unconstrained values. for example, under the hood a parameter that is constrained to be positive is represented as an unconstrained tensor in log space.
as well as in the documentation of the named_parameters() method of the old interface:
Note that, in the event the parameter is constrained, unconstrained_value is in the unconstrained space implicitly used by the constraint.

reduce_max function in tensorflow

Screenshot
>>> boxes = tf.random_normal([ 5])
>>> with s.as_default():
... s.run(boxes)
... s.run(keras.backend.argmax(boxes,axis=0))
... s.run(tf.reduce_max(boxes,axis=0))
...
array([ 0.37312034, -0.97431135, 0.44504794, 0.35789603, 1.2461706 ],
dtype=float32)
3
0.856236
.
Why am I getting 0.8564. I expect the value to be 1.2461. since 1.2461 is big.right?
I am getting correct answer if i use tf.constant.
But I am not getting correct answer while using radom_normal
Each time a new boxes is regenerated when you run s.run() with radom_normal. So your three results are different. If you want to get consistent results, you should only run s.run() once.
result = s.run([boxes,keras.backend.argmax(boxes,axis=0),tf.reduce_sum(boxes,axis=0)])
print(result[0])
print(result[1])
print(result[2])
#print
[ 0.69957364 1.3192859 -0.6662426 -0.5895929 0.22300807]
1
0.9860319
In addition, the code should be given in text format rather than picture format.
TensorFlow is different from numpy because TF only uses symbolic operations. That means when you instantiate the random_normal, you don't get numeric values, but a symbolic normal distribution, so each time you evaluate it, you get different numbers.
Each time you operate with this distribution, with any other operation, you are getting different numbers, and that explains the results you see.

Seaborn FacetGrid: while mapping a stripplot dodge not implemented

Using Seaborn, I'm trying to generate a factorplot with each subplot showing a stripplot. In the stripplot, I'd like to control a few aspects of the markers.
Here is the first method I tried:
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time", hue="smoker")
g = g.map(sns.stripplot, 'day', "tip", edgecolor="black",
linewideth=1, dodge=True, jitter=True, size=10)
And produced the following output without dodge
While most of the keywords were implemented, the hue wasn't dodged.
I was successful with another approach:
kws = dict(s=10, linewidth=1, edgecolor="black")
tips = sns.load_dataset("tips")
sns.factorplot(x='day', y='tip', hue='smoker', col='time', data=tips,
kind='strip',jitter=True, dodge=True, **kws, legend=False)
This gives the correct output:
In this output, the hue is dodged.
My question is: why did g.map(sns.stripplot...) not dodge the hue?
The hue parameter would need to be mapped to the sns.stripplot function via the g.map, instead of being set as hue to the Facetgrid.
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")
g = g.map(sns.stripplot, 'day', "tip", "smoker", edgecolor="black",
linewidth=1, dodge=True, jitter=True, size=10)
This is because map calls sns.stripplot individually for each value in the time column, and, if hue is specified for the complete Facetgrid, for each hue value, such that dodge would loose its meaning on each individual call.
I can agree that this behaviour is not very intuitive unless you look at the source code of map itself.
Note that the above solution causes a Warning:
lib\site-packages\seaborn\categorical.py:1166: FutureWarning:elementwise comparison failed;
returning scalar instead, but in the future will perform elementwise comparison
hue_mask = self.plot_hues[i] == hue_level
I honestly don't know what this is telling us; but it seems not to corrupt the solution for now.

Scikit-learn: How to extract features from the text?

Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Resources