How to get the log-denisty from pymc3 model? - probability-density

I have drawn samples from a simple model using pymc3:
import pymc3 as pm
with pm.Model() as model:
var_x = pm.Normal(name='var_x', mu = 0, sd = 1)
trace = pm.sample(10)
print(trace['var_x'])
I wonder if trace contains the values of the log-density (pm.Normal) for each value in trace['var_x'] and how to extract it.
If trace does not retain the log-density, is there any other possibility to get the values by using pymc3?
Thanks

In your case you can recompute it by doing
[var_x.logp(i) for i in trace]
or more general
[[free.logp(i) for i in trace] for free in model.free_RVs]]
You may also want to check how similar expressions are used in PyMC3 to compute information criteria stats

Related

Error when trying to get tidy summaries with broomstick for multiple randomForest models

I'm grouping my data frame and fitting each group's data with a random forest model, and then using broomstick to get tidy outputs for each group's model. I'm running into trouble when I get to tidy and augment.
I can group the data and fit the models.
library(tidyverse)
library(broomstick)
library(randomForest)
data<-data.frame(y=rep(rep(c(1,0),each=100),5),
group=rep(c("A","B","C","D","E"), each=200),
x1=rnorm(2000),
x2=rnorm(2000),
x3=rnorm(2000),
x4=rnorm(2000),
x5=rnorm(2000))
GroupModels<-data%>%
nest(data= -group)%>%
mutate(fit = map(data, ~ randomForest(y ~ ., ntree=101, mtry=2, data = .x, importance=TRUE)))
I then map glance to the fitted models and that works. I get mse and rsq for each group.
GroupModels%>%
mutate(glanced = map(fit, glance))%>%
unnest(glanced)%>%
select(-data, -fit)%>%
as.data.frame()
If I map tidy to the fitted models I get an output and a deprecation warning and I don't understand where tibble::as_tibble() should come into play.
GroupModels%>%
mutate(tidied = map(fit, tidy))%>%
unnest(tidied)%>%
select(-data, -fit)%>%
as.data.frame()
1: Problem with mutate() column tidied. i tidied = map(fit, tidy). i This function is deprecated as of broom 0.7.0 and will be
removed from a future release. Please see tibble::as_tibble().
If I map augment to the models I get an error and I'm not sure what to do with that.
GroupModels%>%
mutate(augmented = map(fit, augment))%>%
unnest(augmented)%>%
select(-data, -fit)%>%
as.data.frame()
Error: Problem with mutate() column augmented. i augmented = map(fit, augment). x argument must be coercible to non-negative
integer
I can now get augment to work using "map2", didn't know about this, but it's handy when you need both the fit and the data for a function. I guess I'll worry about the deprecation warning when it happens.
GroupModels%>%
mutate(augmented = map2(fit, data, augment))%>%
unnest(augmented)%>%
select(-data, -fit)%>%
as.data.frame()

Seaborn FacetGrid: while mapping a stripplot dodge not implemented

Using Seaborn, I'm trying to generate a factorplot with each subplot showing a stripplot. In the stripplot, I'd like to control a few aspects of the markers.
Here is the first method I tried:
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time", hue="smoker")
g = g.map(sns.stripplot, 'day', "tip", edgecolor="black",
linewideth=1, dodge=True, jitter=True, size=10)
And produced the following output without dodge
While most of the keywords were implemented, the hue wasn't dodged.
I was successful with another approach:
kws = dict(s=10, linewidth=1, edgecolor="black")
tips = sns.load_dataset("tips")
sns.factorplot(x='day', y='tip', hue='smoker', col='time', data=tips,
kind='strip',jitter=True, dodge=True, **kws, legend=False)
This gives the correct output:
In this output, the hue is dodged.
My question is: why did g.map(sns.stripplot...) not dodge the hue?
The hue parameter would need to be mapped to the sns.stripplot function via the g.map, instead of being set as hue to the Facetgrid.
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")
g = g.map(sns.stripplot, 'day', "tip", "smoker", edgecolor="black",
linewidth=1, dodge=True, jitter=True, size=10)
This is because map calls sns.stripplot individually for each value in the time column, and, if hue is specified for the complete Facetgrid, for each hue value, such that dodge would loose its meaning on each individual call.
I can agree that this behaviour is not very intuitive unless you look at the source code of map itself.
Note that the above solution causes a Warning:
lib\site-packages\seaborn\categorical.py:1166: FutureWarning:elementwise comparison failed;
returning scalar instead, but in the future will perform elementwise comparison
hue_mask = self.plot_hues[i] == hue_level
I honestly don't know what this is telling us; but it seems not to corrupt the solution for now.

Scikit-learn: How to extract features from the text?

Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Matlab, Econometrics toolbox - Simulate ARIMA with deterministic time-varying variance

DISCLAIMER: This question is only for those who have access to the econometrics toolbox in Matlab.
The Situation: I would like to use Matlab to simulate N observations from an ARIMA(p, d, q) model using the econometrics toolbox. What's the difficulty? I would like the innovations to be simulated with deterministic, time-varying variance.
Question 1) Can I do this using the in-built matlab simulate function without altering it myself? As near as I can tell, this is not possible. From my reading of the docs, the innovations can either be specified to have a constant variance (ie same variance for each innovation), or be specified to be stochastically time-varying (eg a GARCH model), but they cannot be deterministically time-varying, where I, the user, choose their values (except in the trivial constant case).
Question 2) If the answer to question 1 is "No", then does anyone see any reason why I can't edit the simulate function from the econometrics toolbox as follows:
a) Alter the preamble such that the function won't throw an error if the Variance field in the input model is set to a numeric vector instead of a numeric scalar.
b) Alter line 310 of simulate from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
E(:,(maxPQ + 1:end)) = (ones(NumPath, 1) * sqrt(variance)) .* Z;
where NumPath is the number of paths to be simulated, and it can be assumed that I've included an error trap to ensure that the (input) deterministic variance path stored in variance is of the right length (ie equal to the number of observations to be simulated per path).
Any help would be most appreciated. Apologies if the question seems basic, I just haven't ever edited one of Mathwork's own functions before and didn't want to do something foolish.
UPDATE (2012-10-18): I'm confident that the code edit I've suggested above is valid, and I'm mostly confident that it won't break anything else. However it turns out that implementing the solution is not trivial due to file permissions. I'm currently talking with Mathworks about the best way to achieve my goal. I'll post the results here once I have them.
It's been a week and a half with no answer, so I think I'm probably okay to post my own answer at this point.
In response to my question 1), no, I have not found anyway to do this with the built-in matlab functions.
In response to my question 2), yes, what I have posted will work. However, it was a little more involved than I imagined due to matlab file permissions. Here is a step-by-step guide:
i) Somewhere in your matlab path, create the directory #arima_Custom.
ii) In the command window, type edit arima. Copy the text of this file into a new m file and save it in the directory #arima_Custom with the filename arima_Custom.m.
iii) Locate the econometrics toolbox on your machine. Once found, look for the directory #arima in the toolbox. This directory will probably be located (on a Linux machine) at something like $MATLAB_ROOT/toolbox/econ/econ/#arima (on my machine, $MATLAB_ROOT is at /usr/local/Matlab/R2012b). Copy the contents of #arima to #arima_Custom, except do NOT copy the file arima.m.
iv) Open arima_Custom for editing, ie edit arima_Custom. In this file change line 1 from:
classdef (Sealed) arima < internal.econ.LagIndexableTimeSeries
to
classdef (Sealed) arima_Custom < internal.econ.LagIndexableTimeSeries
Next, change line 406 from:
function OBJ = arima(varargin)
to
function OBJ = arima_Custom(varargin)
Now, change line 993 from:
if isa(OBJ.Variance, 'double') && (OBJ.Variance <= 0)
to
if isa(OBJ.Variance, 'double') && (sum(OBJ.Variance <= 0) > 0)
v) Open the simulate.m located in #arima_Custom for editing (we copied it there in step iii). It is probably best to open this file by navigating to it manually in the Current Folder window, to ensure the correct simulate.m is opened. In this file, alter line 310 from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
%Check that the input variance is of the right length (if it isn't scalar)
if isscalar(variance) == 0
if size(variance, 2) ~= 1
error('Deterministic variance must be a column vector');
end
if size(variance, 1) ~= numObs
error('Deterministic variance vector is incorrect length relative to number of observations');
end
else
variance = variance(ones(numObs, 1));
end
%Scale innovations using deterministic variance
E(:,(maxPQ + 1:end)) = sqrt(ones(numPaths, 1) * variance') .* Z;
And we're done!
You should now be able to simulate with deterministically time-varying variance using the arima_Custom class, for example (for an ARIMA(0,1,0)):
ARIMAModel = arima_Custom('D', 1, 'Variance', ScalarVariance, 'Constant', 0);
ARIMAModel.Variance = TimeVaryingVarianceVector;
[X, e, VarianceVector] = simulate(ARIMAModel, NumObs, 'numPaths', NumPaths);
Further, you should also still be able to use matlab's original arima class, since we didn't alter it.

IllegalArgumentException when using weka.clusterers.HierarchicalClusterer

I searched a lot, but I was not able to find any example code, which describes how to use the WEKA HierarchicalClusterer. Using the following C#-code gives me an IllegalArgumentException at "agg.buildClusterer(insts);".
weka.clusterers.HierarchicalClusterer agg = new weka.clusterers.HierarchicalClusterer();
agg.setNumClusters(NumCluster);
/*
Tag[] TAGS_LINK_TYPE = agg.getLinkType().getTags();
agg.setLinkType(new SelectedTag(1, TAGS_LINK_TYPE));
*/
agg.buildClusterer(insts);
for (int i = 0; i < insts.numInstances(); i++)
{
int clusterNumber = agg.clusterInstance(insts.instance(i));
}
The StackTrace says:
at java.util.PriorityQueue..ctor(Int32 initialCapacity, Comparator comparator)
at weka.clusterers.HierarchicalClusterer.doLinkClustering(Int32 , Vector[] , Node[] )
at weka.clusterers.HierarchicalClusterer.buildClusterer(Instances data)
but no Message or InnerException is specified.
The varaible "insts" is an Instances-object, which only holds instances with an equal amount of numerical attributes.
Is anyone able to quickly find my error or please post/link some example code?
Further, is the setting of the LinkType (commented code) correct?
Thanks,
Björn
The HierarchicalClusterer class has a TAGS_LINK_TYPE attribute. So like
agg.setLinkType(new SelectedTag(1, HierarchicalClusterer.TAGS_LINK_TYPE));
will achieve what you are after for setting the linking. Now what on earth does that 1 mean? From the javadocs we see what TAGS_LINK_TYPE contains:
-L Link type (Single, Complete, Average, Mean, Centroid, Ward, Adjusted complete, Neighbor Joining)
[SINGLE|COMPLETE|AVERAGE|MEAN|CENTROID|WARD|ADJCOMLPETE|NEIGHBOR_JOINING]
In general, your code looks ok for the C# case. I see you don't set the distance metric in your example above and maybe you would want to do this? I too use Weka as best I can with C# using IKVM. I have found the dataset allowed for hierarchical clustering is not too large. Maybe your dataset exceeds what WEKA can handled and you would avoid your error if you reduced the size of the dataset?

Resources