I have been reading the concept of ANN for applying it on my project (credit card fraud detection). Given a set of inputs to the network, say:
A1 - Time to input PIN
A2 - Amount to be withdrawn
A3 - ATM location
A4 - Global behavior (Time & date, & sequence in performing a transaction )
The more any of these inputs deviates from the "norm", the greater the weight of that input to the network. Here comes my question, how does the Neural Network treat a situation whereby one input's weight, say A1, is high whilst all the other weights are low?
The input probability density functions combine to form a multidimensional probability distribution (usually an ellipsoid in that many dimensions). The combination of the inputs is a vector, and the probability value at that point in the N-space tells you how likely it is to be real or fake. This works along each of the axes, where all but one input would be zero, as well as out where all the variables have significant values. If all of your inputs have smooth gaussian probability distributions your resulting probability distribution is a hyperellipsoid and you don't really need a neural net.
Using a neural net gets economical when you have a complicated probability density in one or more of the variables, or if combining the variables creates unexpected features (holes and bumps) in the probability density. Then the training of the neural net over a large number of real input combinations and known results tells it what regions of the input space are interesting and what regions are mundane. Again, you could just map them yourself in a big N-dimensional array with high resolution, if you have enough memory, but where's the fun in that? The neural net will also interpolate smoothly between regions, which may make its decisions more fuzzy than the actual probability space (i.e., that's where the accuracy metric drops below 100%).
Related
I am stuck with a supposedly simple problem. I have a mixture of experts system, consisting of multiple neural networks for classification, whose mixture weights are determined by the data as well. For the posterior probability of label y, given data x and K different expert networks, we have:
In this scheme, p(y|z_k,x) are expert network posterior probabilities, which are Softmax functions applied to network outputs. p(z_k|x) are the network weights.
My problem is the following. Usually, in Pytorch we feed the outputs of the last layer (logits) into the cross entropy loss function. Pytorch handles the numerical stability issues with the Logsumexp trick (How is log_softmax() implemented to compute its value (and gradient) with better speed and numerical stability?). In my case here however, my model output is not the logits, the probabilities, directly instead, due to the nature of the mixture model. Taking the logarithm of the mixture probabilities and feeding into the NLL loss crashes after a couple of iterations since some probabilities quickly become very close to 0 and underflow-overflow issues start to appear. The calculation would be very unstable, numerically.
In this particular case, what would be the correct way to calculate CE (or NLL) loss, without losing the numerical stability?
I have just started studying nueral networks and I managed to figure out how to derive the equations necessary for back propagation. I've spent nearly 3 days asking all of my professors and googling everything I can find. My math skills are admittedly poor but I really want to understand how this particular formula mathematically makes sense. The formula is used to update the weight after the gradient has already been found.
W1 = W0 - L * (dC/dw)
Where:
W1 = new weight
W0 = old weight
L = learning rate
dC/dw = the partial derivative of error function and a member of the gradient vector of the Cost
function
What I know so far:
The gradient is a vector of it's partial derivatives and the maximum rate of increase is given by the gradient itself. Each partial derivative gives the maximum rate of change in the direction that the derivative is taken with respect to.
dC/dW is one of these partial derivatives.
dC/dW evaluates to a rate of change. It's sign can tell us the direction of change. The value itself is the proportion between change in Cost and change in weight at a particular weight.
Somehow multiplying dC/dW by the learning rate is only taking a small portion of this rate as the change in weight.
What I can't reconcile:
The learning rate is just a scalar without units. How is it possible to just multiply a scalar by a rate and end up with a measurable change in weight? What am I failing to understand here?
Artificial neural networks(ANN) are based on the concept taken from Human nervous system . The basic unit of human nervous system is neuron. To sense a stimulus these neurons are present in the whole body and each neuron is connected with the other neuron, in-order to transmit the message form that part of body to brain.Signal transmission by the neurons is controlled by the concentration of certain chemicals present in the neuron. The concentration of there chemicals remain in a balanced state normally and it does not got disturb until a stimulus is sensed. Hence, does not transmit a signal to other neuron unless there is a stimulus. However, when a stimulus is sensed (e.g a person got his figure cut from its tip, a stimulus is sensed at the figure tip by the neurons present over there), the concentration of chemicals on the surface of a neuron got increased and signal is transmitted to the other neuron. The nature of signal and message encoded inside depends upon the concentration of change in chemicals.
In ANN a neuron is mathematical function or formula, and weights of a neuron are similar to the level of chemical concentration in a human neuron. The weights should be adjusted so that a fix formula can encode all the information to perform all the desired predictions as, encoded in human neuron through concentration of chemical. In order to find out the correct weights ANN is trained by huge data for the problem for which ANN is being trained.
The learning rate is a scalar that usually varies between 0 and 1, both inclusive. Simply learning rate define the pace to update the weight. The derivative is rate of change between two values. Here, in this case the two points are the predicted values and the real values. For, (dC/dw) you can simply use a cost function as well, it is also known as responsibility of that very neuron in the error in whole network. The formula may varies form layer to layer and text to text as well. here is a link that well explains the fed forward neural network structure in detail. hope you will understand it.If you are still confuse you may ask further.
I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.
Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.
Even worse still, what if f(x,x1) that we are trying to learn is simply
f(x,x1) = x * x1
That seems very simple and straightforward. Can a neural network learn it? I don't see how.
Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?
The key point to understand is compact:
Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.
In other words the theory states that, given:
A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,
then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].
Regarding your example of f(x) = x2, yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.
But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x2 for all x in [-∞, +∞].
The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.
The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.
A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!
The question is neither off-topic nor does it represent the OP's opinion.
I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.
Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.
I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:
f(x) = x^2
The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?
If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.
If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.
To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.
Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.
How would a recurrent neural network approximate that?
I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.
Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.
A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.
A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.
However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.
One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.
There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x
Nor can it learn an infinite number of other functions, unless you assume the impractical:
1- an infinite number of training examples
2- an infinite number of units
3- an infinite amount of time to converge
NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it!
No more!
Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck.
Infinite functions especially those that are recursive cannot be learned. They just are!