Compare results from Julia MLJ models - machine-learning

I'd like to train 3 models in MLJ.jl: ARDRegressor, AdaBoostRegressor, BaggingRegressor
Currently, I train them 1 at a time for example:
using Pkg; Pkg.activate("."); Pkg.instantiate();
using RDatasets, MLJ, Statistics, PrettyPrinting, GLM
X, y = #load_boston; train, test = 1:406, 407:506
#load ARDRegressor
reg = ARDRegressor
m = machine(reg(), X, y);
fit!(m, rows=train);
ŷ = predict(m, rows=test)
os_ARDRegressor = rms(ŷ , y[test])
I'd like to train them w/ a loop such as:
modlist = [ARDRegressor; AdaBoostRegressor; BaggingRegressor]
score = []
for (i, mod) in enumerate(modlist)
#load mod;
reg = mod;
m = machine(reg(), X, y);
fit!(m, rows=train);
ŷ = predict(m, rows=test)
push!( score, (i, mod, rms(ŷ , y[test])) )
end

There are a few issues in running your last code block.
The loop for jj in eachindex(Models) iterates over the indices of the Models array, so jj takes the values 1, 2, 3. Rather loop over the Models array directly.
#load ARDRegressor is a macro invocation; this means #load jj will translate to #load("jj"), so doesn't use jj like the variable you intended.
The value of os_jj will be overwritten on every iteration of the loop. You rather want to keep the score in an array at that index: os[jj] = ...
MLJ requires you to import the packages that contain the models before loading them. Remind yourself that using ScikitLearn requires the sklearn python package to be installed in your current environment.
Consider the following working code example:
using MLJ, RDatasets
X, y = #load_boston; train, test = 1:406, 407:506
models = [#load ARDRegressor; #load AdaBoostRegressor; #load BaggingRegressor]
score = Array{Float64}(undef, 3)
for (i, model) in enumerate(models)
m = machine(model, X, y)
fit!(m, rows=train);
ŷ = predict(m, rows=test)
score[i] = rms(ŷ, y[test])
end
#show score
Side note: the use of using Pkg; Pkg.activate(".") is unnecessary when running julia using this command: julia --project. But this comes down to personal preference

Related

glmnet: Fit a GLMM with lasso or ridge and add binomial cloglog link

How can one specify link functions in glmnet for lasso / ridge / elastic net regression?
I have found the following post but not sure this helps me when I need to specify a cloglog link.
How to specify log link in glmnet?
I have a survey data set with binary response 0/1 (disease no/yes) and several predictor variables, which are mostly binary categorical (yes/no, male/female), some are counts (herd size), and a few are categorical with several levels.
I previously ran a generalized linear mixed model using glmer() function with binomial family and link = cloglog as doing so created the exact interpretation of the resulting intercept that I wanted (in disease study the intercept from this setup is equivalent to the mean value 'force of infection' - the rate at which susceptibles become infected - among the variation specified in the random effect (in my case the geographic unit (village or subvillage or household).
As there are several survey variables now available to me, I wanted to try a lasso and a ridge regression using glmnet. It is my understanding that I should best do this by putting in the glmm formula into the glmnet. However, I cannot find any documentation about how to add a link. I did so, in the syntax I thought would work, and it did run. But it also ran with nonsense entered in the link function.
Here is a reproducible example:
library(msm)
library(glmnet)
set.seed(1)
N = 1000
X = cbind( rbinom(n=N,size=1,prob=0.5), rnorm(n=N) )
beta = c(-0.1,0.1)
phi.true = exp( X%*%beta )
p = 1 - exp(-phi.true)
y = rbinom(n=N,size=1,prob = p)
dat <- data.frame(x=X,y=y)
x <- model.matrix(y~., dat)
glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
I get the same output whether I put in 'logit', 'cloglog' or even a name 'adam'. And cannot use same syntax as GLMM as in glmnet must be a character vector.
OUTPUT:
> glmnet(x, y, family="binomial"(link="logit"), alpha = 1, lambda = 2)
Error in match.arg(family) : 'arg' must be NULL or a character vector
> glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "logit")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="cloglog", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "cloglog")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="adam", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "adam")
Df %Dev Lambda
1 0 -7.12e-15 2
Is it not possible to change the default link function for binomial family in glmnet?
I think you want to use family = binomial(link = "cloglog")
See the new glmnet vignette: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnetFamily.pdf

Need a vectorized solution in pytorch

I'm doing an experiment using face images in PyTorch framework. The input x is the given face image of size 5 * 5 (height * width) and there are 192 channels.
Objective: To obtain patches of x of patch_size(given as argument).
I have obtained the required result with the help of two for loops. But I want a better-vectorized solution so that the computation cost will be very less than using two for loops.
Used: PyTorch 0.4.1, (12 GB) Nvidia TitanX GPU.
The following is my implementation using two for loops
def extractpatches( x, patch_size): # x is bsx192x5x5
patches = x.unfold( 2, patch_size , 1).unfold(3,patch_size,1)
bs,c,pi,pj, _, _ = patches.size() #bs,192,
cnt = 0
p = torch.empty((bs,pi*pj,c,patch_size,patch_size)).to(device)
s = torch.empty((bs,pi*pj, c*patch_size*patch_size)).to(device)
//Want a vectorized method instead of two for loops below
for i in range(pi):
for j in range(pj):
p[:,cnt,:,:,:] = patches[:,:,i,j,:,:]
s[:,cnt,:] = p[:,cnt,:,:,:].view(-1,c*patch_size*patch_size)
cnt = cnt+1
return s
Thanks for your help in advance.
I think you can try this as following. I used some parts of your code for my experiment and it worked for me. Here l and f are the lists of tensor patches
l = [patches[:,:,int(i/pi),i%pi,:,:] for i in range(pi * pi)]
f = [l[i].contiguous().view(-1,c*patch_size*patch_size) for i in range(pi * pi)]
You can verify the above code using toy input values.
Thanks.

Efficient way to extract and collect a random subsample of a generator in Julia

Consider a generator in Julia that if collected will take a lot of memory
g=(x^2 for x=1:9999999999999999)
I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory
Until now the trick I was using was this
temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]
But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right
Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.
Thank you very much
You can use a generator with if condition like this:
[v for v in g if rand() < 0.01]
or if you want a bit faster, but more verbose approach (I have hardcoded 0.01 and element type of g and I assume that your generator supports length - otherwise you can remove sizehint! line):
function collect_sample(g)
r = Int[]
sizehint!(r, round(Int, length(g) * 0.01))
for v in g
if rand() < 0.01
push!(r, v)
end
end
r
end
EDIT
Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:
function self_avoiding_sampler(source_size, ith, target_size)
rng = 1:source_size
idx = rand(rng)
x1 = ith(idx)
r = Vector{typeof(x1)}(undef, target_size)
r[1] = x1
s = Set{Int}(idx)
sizehint!(s, target_size)
for i = 2:target_size
while idx in s
idx = rand(rng)
end
#inbounds r[i] = ith(idx)
push!(s, idx)
end
r
end
function reservoir_sampler(g, target_size)
r = Vector{Int}(undef, target_size)
for (i, v) in enumerate(g)
if i <= target_size
#inbounds r[i] = v
else
j = rand(1:i)
if j < target_size
#inbounds r[j] = v
end
end
end
r
end

Neural network model not learning?

I tried to model a NN using softmax regression.
After 999 iterations, I got error of about 0.02% for per data point, which i thought was good. But when I visualize the model on tensorboard, my cost function did not reach towards 0 instead I got something like this
And for weights and bias histogram this
I am a beginner and I can't seem to understand the mistake. May be I am using a wrong method to define cost?
Here is my full code for reference.
import tensorflow as tf
import numpy as np
import random
lorange= 1
hirange= 10
amplitude= np.random.uniform(-10,10)
t= 10
random.seed()
tau=np.random.uniform(lorange,hirange)
x_node = tf.placeholder(tf.float32, (10,))
y_node = tf.placeholder(tf.float32, (10,))
W = tf.Variable(tf.truncated_normal([10,10], stddev= .1))
b = tf.Variable(.1)
y = tf.nn.softmax(tf.matmul(tf.reshape(x_node,[1,10]), W) + b)
##ADD SUMMARY
W_hist = tf.histogram_summary("weights", W)
b_hist = tf.histogram_summary("biases", b)
y_hist = tf.histogram_summary("y", y)
# Cost function sum((y_-y)**2)
with tf.name_scope("cost") as scope:
cost = tf.reduce_mean(tf.square(y_node-y))
cost_sum = tf.scalar_summary("cost", cost)
# Training using Gradient Descent to minimize cost
with tf.name_scope("train") as scope:
train_step = tf.train.GradientDescentOptimizer(0.00001).minimize(cost)
sess = tf.InteractiveSession()
# Merge all the summaries and write them out to logfile
merged = tf.merge_all_summaries()
writer = tf.train.SummaryWriter("/tmp/mnist_logs_4", sess.graph_def)
error = tf.reduce_sum(tf.abs(y - y_node))
init = tf.initialize_all_variables()
sess.run(init)
steps = 1000
for i in range(steps):
xs = np.arange(t)
ys = amplitude * np.exp(-xs / tau)
feed = {x_node: xs, y_node: ys}
sess.run(train_step, feed_dict=feed)
print("After %d iteration:" % i)
print("W: %s" % sess.run(W))
print("b: %s" % sess.run(b))
print('Total Error: ', error.eval(feed_dict={x_node: xs, y_node:ys}))
# Record summary data, and the accuracy every 10 steps
if i % 10 == 0:
result = sess.run(merged, feed_dict=feed)
writer.add_summary(result, i)
I got the same plot like you a couple of times.
That happened mostly when I was running tensorboard on multiple log-files. That is, the logdir I gave to TensorBoard contained multiple log-files. Try to run TensorBoard on one single log-file and let me know what happens

combine time series plot by using R

I wanna combine three graphics on one graph. The data from inside of R which is " nottem ". Can someone help me to write code to put a seasonal mean and harmonic (cosine model) and its time series plots together by using different colors? I already wrote model code just don't know how to combine them together to compare.
Code :library(TSA)
nottem
month.=season(nottem)
model=lm(nottem~month.-1)
summary(nottem)
har.=harmonic(nottem,1)
model1=lm(nottem~har.)
summary(model1)
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
points(y=nottem,x=time(nottem), pch=as.vector(season(nottem)))
Just put your time series inside a matrix:
x = cbind(serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)),
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)))
plot(x)
Or configure the plot region:
par(mfrow = c(2, 1)) # 2 rows, 1 column
serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
require(zoo)
plot(serie1)
lines(rollapply(serie1, width = 10, FUN = mean), col = 'red')
plot(serie2)
lines(rollapply(serie2, width = 10, FUN = mean), col = 'blue')
hope it helps.
PS.: zoo package is not needed in this example, you could use the filter function.
You can extract the seasonal mean with:
s.mean = tapply(serie, cycle(serie), mean)
# January, assuming serie is monthly data
print(s.mean[1])
This graph is pretty hard to read, because your three sets of values are so similar. Still, if you want to simply want to graph all of these on the sample plot, you can do it pretty easily by using the coefficients generated by your models.
Step 1: Plot the raw data. This comes from your original code.
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
Step 2: Set up x-values for the mean and cosine plots.
x <- seq(1920, (1940 - 1/12), by=1/12)
Step 3: Plot the seasonal means by repeating the coefficients from the first model.
lines(x=x, y=rep(model$coefficients, 20), col="blue")
Step 4: Calculate the y-values for the cosine function using the coefficients from the second model, and then plot.
y <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
lines(x=x, y=y, col="red")
ggplot variant: If you decide to switch to the popular 'ggplot2' package for your plot, you would do it like so:
x <- seq(1920, (1940 - 1/12), by=1/12)
y.seas.mean <- rep(model$coefficients, 20)
y.har.cos <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
plot_Data <- melt(data.frame(x=x, temp=nottem, seas.mean=y.seas.mean, har.cos=y.har.cos), id="x")
ggplot(plot_Data, aes(x=x, y=value, col=variable)) + geom_line()

Resources