Inverting feedforward neural network - machine-learning

I have been searching online for papers on inverting feedforward neural networks and it turns out there are NLP and LP algorithms for inverting them. But most of the papers were interested in receiving an inversed mapping one-to-many. I am wondering about this kind of problems:
Say I have a function z=x+y and I will teach my FFNN to approximate this function. Once it has been taught, I would like to take for example x and z as inputs and would like to get y as output. So it is not exactly mapping one-to-many, but is it any easier than problem of having just z and wanting to compute x and y? Are there any algorithms for performing such task?

To best of my knowledge methods that inverse a network are usually Adversarial methods (GANs) or do so by optimizing the networks output (let's say optimizing |f(x, y') - z| where y' is the output in your problem). The first method is more popular.
Let's talk about the first method more, we call the network that you trained to learn x + y = z network D. You will have to teach a network (let's call it G) to get x and z and produce y and then D checks if that's the correct answer (i.e. if x + y = z), we continue this until G learns to satisfy D (I have left some detail out, you can learn more by studying about GANs). However, this is more like reformulating our problem.
If you're familiar with how NNs work, you'll know that it's hard to train a network by determining its desired output and part of input, since we can't use back propagation.
Finally, you might want to check this paper out. there is not much technical details but it is proposing precisely what you asked :
https://openreview.net/forum?id=BJxMQbQ3wm

Related

Do we understand the mathematics behind Neural Networks?

So I read somewhere that we as humans don't understand what exactly happens in a neural network, we just know that a neuron does something using the biases and the inputs given to it and leads us to a specific output.
My question here is, do we understand (mathematically speaking) how X input leads the computer to give Y input? If we don't, then why don't we understand it?
Let X be an input Matrix and Y be the associated output vector (our target). Let theta be the parameter of our model, representing the weights and the bias of each neuron.
Mathematically, a neural network can be represented as a function f such as f(X, theta) = Y + epsilon. Epsilon is the error of the model. The goal is to find the value of theta that is minimizing epsilon. To do so, we just have to find the global minimum of the multivariate function epsilon(theta) = f(X, theta) - Y. This is an optimization problem that can be solved thanks to gradient descent. So yes, mathematically, we understand how X input leads the computer to give output Y: it is just a matter of finding the minimum of a function. Additionally, as the structure of a neural network is quite simple (linear layers + activation functions), we are able to calculate easily the derivatives of epsilon() and propagate them throw the network.
However, it's not because we can explain mathematically a neural network that we can interpret it. It's very difficult to know the specific role played by each neuron in the prediction. In contrary, decision trees are much more interpretable, as we know which feature was used to make a split at each node of the tree.

Inverse prediction in Machine Learning

I have a question on inverse prediction in Machine Learning/Data Science. Here I give a example to illustrate my question: I have 20 input features X = (x0, x1, ... x19) and 3 output variables Y = (y0, y1, y2). The number of training/test data usually small, such as <1000 items or even <100 in the training set.
In general, by using the machine learning toolbox (such as scikit learn), I can train the models (such as random forest, linear/polynomial regression and neural network) from X --> Y. But what I actually want to know is, for example, how should I set X, so that I can have y1 values in a specific range (for example y1 > 100).
Does anyone know how to solve this kind of "inverse prediction"? There are two ways in my mind:
Train the model in the normal way: X-->Y, then set a dense mesh in the high dimension X space. In this example, it is 20 dimensions. Then use all the point in this mesh as input data and throw them to the trained model. Select all the input points where the predicted y1 > 100. Finally, use some methods, such as clustering to look for some patterns in the selected data points.
Direct learn models from Y to X. Then, set a dense mesh in the high dimension Y space, where let y1 > 100. Then use the trained models to calculate the X data points.
The second method might be OK when the Y also have high dimensions. But usually, in my application, Y is very low-dimension and X is very high-dimension, which makes me think method 2 is not very practical.
Does anyone have any new thoughts? I think this should be somehow very common in industry and maybe some people meet similar situation before.
Thank you!
From what I understand of your needs, #1 is an excellent fit for this problem. I recommend that you use a simple binary classifier SVM to discriminate good/bad X vectors. SVM works well with high-dimensional spaces, and reading out the coefficients is easy in most SVM interfaces.
Similar note that may be useful:
In inverse/backward prediction, we can predict inversely with similar accuracy of direct/forward prediction of X--->Y and backward of Y--->X only just with solving the systems of equations X<---->Y assuming weights and intercepts. Also, usually, it is better for linear problems AX=B. Note that it is usually possible the Python code for inverse prediction has a considerable error while solving the system of equations (n*n) is better choice with suitable accuracy for that.
Regards

Machine Learning - Features selection and Modeling (Connection between samples)

I am new to Machine Learning, and I have got several questions with my data now.
Let's say I have X samples, Y features and I also have the connection between x1 and x2 (e.g. the interaction count)
As most of the tutorial of Machine Learning start with labels specifically labelled at the sample itself...
I would like to ask how I should build the model? I want to have a model that it can predict two specific samples to see how high the interaction counts would be.
Giving me a direction/ keywords to learn would be good enough, thanks!
I have got others suggestion for the approach:
Formulate the problem as z=f(x1,x2), i.e. label depends on tuple of sample. if a dataset of ((x1,x2)=>z) is prepared, it can then be used to train regression, decision trees or networks.

Logistic Regression Usecase

Problem: Given a list of movies watched by a specific user, calculate the probability that he will watch any specific movie.
Approach: It seems that it's a typical logistic regression use case (please correct me if I'm wrong)
Initial Logistic Regression code (please correct if something is wrong):
def sigmoid(x):
return (1/(1+math.exp(-x)))
def gradientDescentLogistic(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = sigmoid(np.dot(x, theta))
loss = hypothesis - y
# The ONLY difference between linear and logistic is the definition of hypothesis
gradient = np.dot(xTrans, loss) / m
theta = theta - alpha * gradient
return theta
Now the parameters here can be different actors, different genres, etc..
I'm unable to figure out how to fit these kind of parameters in the above code
Why it is not a use case of LR?
I would say that this is not a typical use case of Logistic Regression. Why? Because you only know what someone watched, you only have positive samples, you do not know what someone did not watch by decision. Obviously if I watched movies {m1,m2,m3} then I did not watch M\{m1,m2,m3} where M is the set of all movies in the history of mankind. But this is not a good assumption. I did not watch most of them because I do not own them, do not know about them or simply did not yet have time for this. In such case you can only model this as a one-class problem or a kind of density estimation (I do assume you do not have access to any other knowledge then the list of movies seen, so we cannot for example do collaborative filtering or other - crowd based - analysis).
Why not generate negative samples by hand?
Obviously you could for example select randomly movies from some database which were not seen by user and assume that (s)he does not like to see it. But this is just an arbitrary, abstract assumption, and your model will be extremely biased towards this procedure. For example, if you would take all unseen movies as negative samples, then a correct model would simply learn to say "Yes" only for training set, and "No" for the rest. If you randomly sample m movies, it will only learn to distinguish your taste from these m movies. But they can represent anything! In particular - movies that one woul love to see. To sum up - you can do this, to be honest it could even work in some particular applications; but from probabilistic perspective this is not a valid approach as you build in unjustifiable assumptions to the model.
How could I approach this?
So what can you do to do it in a probabilistic manner? You can, for example represent your movies as numerical features (some characteristics), and consequently have a cloud of points in some space R^d (where d is number of features extracted). Then, you can fit any distribution, such as Gaussian distribution (radial with d is big), GMM, or any other. This will give you a clear (easy to understand and "defend") model for P(user will watch|x).

Can neural networks approximate any function given enough hidden neurons?

I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.
Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.
Even worse still, what if f(x,x1) that we are trying to learn is simply
f(x,x1) = x * x1
That seems very simple and straightforward. Can a neural network learn it? I don't see how.
Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?
The key point to understand is compact:
Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.
In other words the theory states that, given:
A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,
then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].
Regarding your example of f(x) = x2, yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.
But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x2 for all x in [-∞, +∞].
The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.
The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.
A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!
The question is neither off-topic nor does it represent the OP's opinion.
I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.
Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?
If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.
If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.
To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that?
I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.
Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.
A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.
A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.
However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.
One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.
There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x
Nor can it learn an infinite number of other functions, unless you assume the impractical:
1- an infinite number of training examples
2- an infinite number of units
3- an infinite amount of time to converge
NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it!
No more!
Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck.
Infinite functions especially those that are recursive cannot be learned. They just are!

Resources