Drawing a Mcculloch-Pitts-Neuron that expresses (x1andx2)orx3

Drawing a Mcculloch-Pitts-Neuron that expresses (x1andx2)orx3 - machine-learning

I understand the the neuron of the logical and operator looks like this.
the logical OR of the McCulloch Neuron looks like this.
My question is how can I combine both of these logical operators if I should draw a neuron architecture that expresses (x1andx2)or x3?
This is my approach but I am not sure if it is correct
Could anyone confirm it's correct or if it's wrong tell me what I should change?
Thank you in advance

There are two ways to do this
Make use of two separate models, in your case boolen AND followed by boolean OR. X1 and X2 are inputs to the AND model and the output x1 AND x2 is one of the inputs along with X3 to the OR model.
A better method is to create a new model with different weights. You can do this by creating a truth table for the expressions, find out if the expression is linearly separable and choose the type of model accordingly.
Linear separable -> single layer
Linear inseparable -> multi layer (ex. XOR gate)

Related

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.

Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.

Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

How to find the best combination of features which will maximise the probability of a particular class?

Suppose we have a classifier having two output classes C1 and C2 and 8 features X1, X2 ... X8. Now how do you find the combination of features (can be a subset as well) such that the likelihood of class C1 is maximized?

you could look at something called "backward elimination" to find the best combination of features which impact the outcome of each of the classes.To what I can interpret from your question, you want to maximize the likelihood of class C1(be biased basically),you could consider going for a weighted approach for the same(higher weights for the features which impact the outcome of class C1)

Inverse prediction in Machine Learning

I have a question on inverse prediction in Machine Learning/Data Science. Here I give a example to illustrate my question: I have 20 input features X = (x0, x1, ... x19) and 3 output variables Y = (y0, y1, y2). The number of training/test data usually small, such as <1000 items or even <100 in the training set.
In general, by using the machine learning toolbox (such as scikit learn), I can train the models (such as random forest, linear/polynomial regression and neural network) from X --> Y. But what I actually want to know is, for example, how should I set X, so that I can have y1 values in a specific range (for example y1 > 100).
Does anyone know how to solve this kind of "inverse prediction"? There are two ways in my mind:
Train the model in the normal way: X-->Y, then set a dense mesh in the high dimension X space. In this example, it is 20 dimensions. Then use all the point in this mesh as input data and throw them to the trained model. Select all the input points where the predicted y1 > 100. Finally, use some methods, such as clustering to look for some patterns in the selected data points.
Direct learn models from Y to X. Then, set a dense mesh in the high dimension Y space, where let y1 > 100. Then use the trained models to calculate the X data points.
The second method might be OK when the Y also have high dimensions. But usually, in my application, Y is very low-dimension and X is very high-dimension, which makes me think method 2 is not very practical.
Does anyone have any new thoughts? I think this should be somehow very common in industry and maybe some people meet similar situation before.
Thank you!

From what I understand of your needs, #1 is an excellent fit for this problem. I recommend that you use a simple binary classifier SVM to discriminate good/bad X vectors. SVM works well with high-dimensional spaces, and reading out the coefficients is easy in most SVM interfaces.

Similar note that may be useful:
In inverse/backward prediction, we can predict inversely with similar accuracy of direct/forward prediction of X--->Y and backward of Y--->X only just with solving the systems of equations X<---->Y assuming weights and intercepts. Also, usually, it is better for linear problems AX=B. Note that it is usually possible the Python code for inverse prediction has a considerable error while solving the system of equations (n*n) is better choice with suitable accuracy for that.
Regards

Output of the logit function as addtional input to a Neural Network

This is w.r.t a hybrid of ANN and logistic regression in a binary classification problem. For example in one of the papers I came across they state that "A hybrid model type is constructed by using the logistic regression model to calculate the probability of failure and then adding that value as an additional input variable into the ANN. This type of model is defined as a Plogit-ANN model".
So, for n input variables, I'm trying to understand how the additional input n+1 to a ANN is treated by the activation function (eg. a logit function) and in the summation of weights multiplied by inputs. Do we treat this probability variable n+1 as one of the standalone weights like a special type of b0 that we add in the sum of weights multiplied by inputs e.g. Summation for each Neuron = (Sum (Wi*Xi))+additional variable.
Thank you for your assistance.

According to the description provided the easiest way is to treat this as additional feature of your data. So you have a model that predicts something about your original dataset (probability of some additional thing), thus you get x -> f(x). You simply concatenate it to your feature vector so x' = [x1 x2 ... xk f(x)], and push it through the network.
However the described approach is quite naive, since you are doing these two things (training f and training neural net) completely independently), what might be more beneficial is to instead treat fitting f as an auxiliary loss and train your model jointly.

Reshaping Inputs that contain continuous and discrete values

The inputs I am using are 2xN, where the first 1xN row are continuous numbers, and the second 1xN row are discrete numbers (that encodes a specific class out of 7 possible classes). I expect there to be a relation between vertically adjacent pairs.
I am looking to use a neural net for a multi-class classifier on this input, but am unsure of how to reshape my data for forward propagation in a way that makes sense.
What is a feasible way to reshape my data into 1x2N for forward propogation that makes sense?
edit:
Example input:
input_features = [[99.3, 22.1, 41.7], [1, 3, 4]]

Unless you know something more than "there might be some kind of relation", you should just flatten the array and pass it as a vector - NN can (in theory) find such realtions on its own (given enough data).
What are the other options? If you suspect that there is a single relation, such that it is true for every single column, then you might want to construct specific neural net. One option is to have a convolution of size 2x1 (single column) in the input layer. On the other hand - if you create large enough set of kernels, this will be able to model more complex relations too. In such case - leave it as a matrix (think about it as an image). There is nothing wrong with discrete values, as long as they are in the reasonable scale.
In general - you will actually just work with specific wiring of the net, not reshaping of an array (however, implementations of conv nets actually use shape to do the work for you, as described).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart