How can I include survey weights in a poisson pint process model fitted to a logistic regression quadrature scheme? - spatial

Is it possible to include weights in a poisson point process model fitted to a logistic regression quadrature scheme? My data is a stratified sample and I would like to account for this sampling strategy in order to have valid population level predictions.

This is a question about the model-fitting function ppm in the R package spatstat.
Yes, you can include survey weights. The easiest way is to create a covariate surveyweight, which could be a function(x,y) or a pixel image or a column of data associated with your quadrature scheme. Then when fitting the model using ppm, add the model term +offset(log(surveyweight)).
The result of ppm will be a fitted model that describes the observed point pattern. You can do prediction, simulation etc from this model, but be aware that these will be predictions or simulations of the observed point process including the effect of non-constant survey effort.
To get a prediction or simulation of the original point process (i.e. after removing the effect of non-constant survey effort) you need to replace the original covariate surveyweight by another covariate that is constant and equal to 1, then pass this to predict.ppm in the argument newdata.

Here are a few lines to elaborate on the answer by #adrian-baddeley.
If you have the setup of your related question and we imagine you have the weights and two covariates in a data.frame in the same order as the points of your quadscheme:
library(spatstat)
X <- split(chorley)$larynx
D <- split(chorley)$lung
Q <- quadscheme.logi(X,D)
covar <- data.frame(weights = runif(npoints(chorley)),
covar1 = rnorm(npoints(chorley)),
covar2 = rnorm(npoints(chorley)))
fit <- ppm(Q ~ offset(log(weights)) + covar1 + covar2, data = covar)

Related

Learning a Sin function

I'm new to Machine Learning
I' building a simple model that would be able to predict simple sin function
I generated some sin values, and feeding them into my model.
from math import sin
xs = np.arange(-10, 40, 0.1)
squarer = lambda t: sin(t)
vfunc = np.vectorize(squarer)
ys = vfunc(xs)
model= Sequential()
model.add(Dense(units=256, input_shape=(1,), activation="tanh"))
model.add(Dense(units=256, activation="tanh"))
..a number of layers here
model.add(Dense(units=256, activation="tanh"))
model.add(Dense(units=1))
model.compile(optimizer="sgd", loss="mse")
model.fit(xs, ys, epochs=500, verbose=0)
I then generate some test data, which overlays my learning data, but also introduces some new data
test_xs = np.arange(-15, 45, 0.01)
test_ys = model.predict(test_xs)
plt.plot(xs, ys)
plt.plot(test_xs, test_ys)
Predicted data and learning data looks as follows. The more layers I add, the more curves network is able to learn, but the training process increases.
Is there a way to make it predict sin for any number of curves? Preferably with a small number of layers.
With a fully connected network I guess you won't be able to get arbitrarily long sequences, but with an RNN it looks like people have achieved this. A google search will pop up many such efforts, I found this one quickly: http://goelhardik.github.io/2016/05/25/lstm-sine-wave/
An RNN learns a sequence based on a history of inputs, so it's designed to pick up these kinds of patterns.
I suspect the limitation you observed is akin to performing a polynomial fit. If you increase the degree of polynomial you can better fit a function like this, but a polynomial can only represent a fixed number of inflection points depending on the degree you choose. Your observation here appears the same. As you increase layers you add more non-linear transitions. However, you are limited by a fixed number of layers you chose as the architecture in a fully connected network.
An RNN does not work on the same principals because it maintains a state and can make use of the state being passed forward in the sequence to learn the pattern of a single period of the sine wave and then repeat that pattern based on the state information.

Delta component doesnt show in weight learning rule of sigmoid activation MLP

As a basic proof of concept, in a network that classifies K classes with input x, bias b, output y,S samples, weights v and t teacher signal in which t(k) equals 1 if the matching sample is under k class.
Variables
Let x_(is) represent the i_(th) input feature in the s_(th) sample.
v_(ks) represents the vector that holds the weights of connection to k_(th) output from all inputs within the s_(th) sample.
t_(s) represents the teacher signal for s_(th) sample.
If we extend the above variables to consider multiple samples, the changes below has to be applied while declaring the variable z_(k), the activation function f(.) and using the corss entropy as a cost function:
Derivation
Typically in learning rule, delta ( t_(k) - y_(k) ) is always included, why Delta doesnt show up in this equation? have i missed something or the delta rule showing up isnt a must?
I managed to find the solution, it's clear when we consider the Kronecker delta in which Where (δck = 1 if a class matches the classifier and δck otherwise). which means the derivation takes this shape:
Derivation
which leads to the delta rule.

How to compute probabilities instead of actual classifications on ML problems

Let's assume that we have a few data points that can be used as the training set. Each row is consisted of 4 say columns (features) that take boolean values. The 5th column expresses the class and it also takes boolean values. Here is an example (they are almost random):
1,1,1,0,1
0,1,1,0,1
1,1,0,0,1
0,0,0,0,0
1,0,0,1,0
0,0,0,0,0
Now, what I want to do is to build a model such that for any given input (new line) the system does not return the class itself (like in the case of a regular classification problem) but instead the probability this particular input belongs to class 0 or class 1. How can I do that? What's more, how can I generate a confidence interval or error rate associated with that computation?
Not all classification algorithms return probabilities, because not all of them have an underlying probabilistic model. For example, a classification tree is just a set of rules that you follow to assign each new input to a particular class.
An example of a classification algorithm that does have an underlying probabilistic model is logistic regression. In this algorithm, the probability that a particular input x is in the class is
prob = 1 / (1 + exp( -theta * x ))
where theta is a vector of coefficients with the same number of dimensions as x. Generally to move from probabilities to classifications, you simply threshold, e.g.
if prob < 0.5
return 0;
else
return 1;
end
Other classification algorithms may have probabilistic interpretations, for example random forests are essentially a voting algorithm with multiple classification trees. If 80% of the trees vote for class 1 and 20% vote for class 2, then you could output an 80% probability of being in class 1. But this is a side effect of how the model works, rather than an explicit underlying probability model.

complex machine learning application

2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.

Probability and Neural Networks

Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.

Resources