I've recently learned about auto-encoders and plan to construct one to use as part of a recommender system with implicit feedback.
Based on how classic autoencoders works, it seems like they can be used for reconstructing vectors whose components are not necessarily 0 or 1. However, all introductory materials out there seem to suggest that autoencoders operate on binary vectors, x=[0,1]^d, as in here, or section 2.2 in this paper.
In order to use autoencoders for non-binary vectors, it seems to me that the only difference is that L2 error function should be used instead of crossentropy which is suitable for binary cases.
I appreciate if someone can clarify this for me.
You are confusing the notation
x e [0, 1]^d
means "x belongs to the space being a d-dimensional unit hypercube". To say "x is a binary vector of length d" you would write
x e {0, 1}^d
Notice different brackets. [0, 1] is an interval, not a set of 2 elements.
Thus noone is claiming that autoencoder requires binary input, and it does not, it is defined in the whole R^d space (however for various reasons it is easier to work with valued from some limited subset, thus [0, 1] hypercube, for which we have quite good heuristics for initialization).
Related
I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well.
Let's say we use self-attention in a NLP task, so our input is a sentence.
Then self-attention can be used to measure how "important" each word in the sentence is for every other word.
The problem is that I do not understand how that "importance" is measured. Important for what?
What exactly is the goal vector the weights in the self-attention algorithm are trained against?
Connecting language with underlying meaning is called grounding. A sentence like “The ball is on the table” results into an image which can be reproduced with multimodal learning. Multimodal means, that different kind of words are available for example events, action words, subjects and so on. A self-attention mechanism works with mapping input vector to output vectors and between them is a neural network. The output vector of the neural network is referencing to the grounded situation.
Let us make a short example. We need a pixel image which is 300x200, we need a sentence in natural language and we need a parser. The parser works in both directions. He can convert text to image, that means the sentence “The ball is on the table” gets converted into the 300x200 image. But it is also possible to parse a given image and extract the natural sentence back. Self-attention learning is a bootstrapping technique to learn and use the grounded relationship. That means to verify existing language models, to learn new one and to predict future system states.
This question is old now but I came across it so I figured I should update others as my own understanding has increased.
Attention simply refers to some operation that takes the output and combines it with some other information. Typically this just happens by taking the dot product of the output with some other vector so it can "attend" to it in some way.
Self-attention combines the output with other parts of the input (hence self part). Again the combination usually occurs via the dot-product between the vectors.
Finally how is attention (or self-attention) trained?
Let's call Z our output, W our weight matrix and X our input (we'll use # as matrix multiplication symbol).
Z = X^T # W^T # X
In NLP we will compare Z to whatever we want the resulting output to be. In machine translation it is the sentence in the other language for example. We can compare the two with average cross entropy loss over each word predicted. Finally we can update W with back propagation.
How do we see what is important? We can look at the magnitudes of Z to see after the attention what words were most "attended" to.
This is a slightly simplified example as it only has one weight matrix and typically the inputs are embedded but I think it still highlights some of the necessary details concerning attention.
Here is a useful resource with visualizations for more information about attention.
Here is another resource with visualizations for more about attention in transformers specifically self-attention.
I was going through a video of Udacity's Intro to AI Class and I can't seem to wrap one idea around my head.
It is stated that for a string of length n 2n-1 segmentations are possible. When we take the Naive Bayes assumption the best segmentation s* can be defined as the one that maximizes
product(P(wi))
It is possible to write the best same as:
s* = argmaxs P(first_word) * s*(rest_of_words)
I understand why the above is True. The instructor said that due to the above equation we do not have to enumerate all 2n-1 cases. I am not able to understand the reason for this.
I also understand that finding P(single_word) is simple than learning the same prob for n-grams and that would help computationally too.
Since we are working with single words, we have to choose one word per time and not all their combination, thus reducing the search space. Consider the string:
"Iliketennis"
The string has 11 chars, thus 2^11=2048 cases. If we start looking at the most probable first word it could be:
"I", "Il", "Ili", "Ilik" and so on. 11 possible cases. Now that we have all the possible first words, we look for the most probable:
P("I")=0.4,
P("Il")=0.0001,
P("Ili")=0.002,
P("Ilik")=0.00003
...
and so on.
Finding out that the most probable is "I", we take it as the first word and now we can focus on the remaining 10 chars/cases:
"liketennis"
Repeating the same process you will have now 10 possible cases for the word, with probability:
P("l")=0.05,
P("lI")=0.0001,
P("lik")=0.0002,
P("lik")=0.00003
P("like")=0.3
...
and so on.
So we pick "like". Now the search is repeated for the last 6 chars. Without writing again the process, "tennis" is picked up and no chars are left, so the segmentation is ended.
Since we have made an analysis word-wise, the possibilities we have considered are
11+10+6=27
much much less than spanning over the 2048 possible splits.
I suggest a video by Mathematicalmonk,
this video: https://youtu.be/qX7n53NWYI4?t=9m43s
He explains that without conditional independence assumption (Naive Bayes), you need much more samples to estimate probabilities when you learn from data. But if you assume (even if it's incorrect) the independence between features, with less training data you can estimate the probability distribution.
Why? let's make it simple, without naive assumption, the probability of a 2-dimensional feature vector for a prediction y would be:
By assuming only binary values for x_1 and x_2 features, you need to store these value per y, learnt from sample data:
P(x_1=0|y), P(x_1=1|y), P(x_2=0|x_1=0,y), P(x_2=0|x_1=1,y), P(x_2=1|x_1=0,y), P(x_2=1|x_1=1,y)
In another word, you need to store parameters. You can generalize it to d-dimensional binary feature vector:
If you take the naive assumption and assume these features are independent, you will have this formula:
which means you only need to store these parameters per y, in order to predict all possible X:
P(x_1=0|y), P(x_1=1|y), P(x_2=0|y), P(x_2=1|y)
Or generalize it to:
I am new to machine learning and AI and started with NN recently.
Already got some information here on stackoverflow, but I don't understand the logic from the whole gathered information at the moment.
Let's take 4 nominal (but not ordinal) values [A, B, C, D] and 2 numericals already normalized [0.35, 0.55] - so 2 input neurons, one for nominal one for numerical.
I mostly see in NN literature you have to use 4 input neurons for encoding. But I don't need it to predict those nominal ones. I have only one output neuron that represents at most a relationship in the way if I would use it with expert systems and rules.
If I would normalize them to [0.2, 0.4, 0.6, 0.8] for example, isn't the NN able to distinguish between them? For the NN it's only a number, isn't it?
Naive approach and thinking:
A with 0.35 numerical leads to ideal 1.
B with 0.55 numerical leads to ideal 0.
C with 0.35 numerical leads to ideal 0.
D with 0.55 numerical leads to ideal 1.
Is there a mistake in my way of thinking about this approach?
Additional info (edit):
Those nominal values are included in decision making (significance if measured with statistics tools by combining with the numerical values), depends if they are true or not. I know they can be encoded binary, but the list of nominal values is a litte bit larger.
Other example:
Symptom A with blood test 1 leads to diagnosis X (the ideal)
Symptom B with blood test 1 leads to diagnosys Y (the ideal)
Actually expert systems are used. Symptoms are nominal values, but in combination with the blood test value you get the diagnosis. The main question finally: Do I have to encode symptoms in binary way or can I replace symptoms with numbers? If I can't replace it with numbers, why binary representation is the only way in usage of a NN?
INPUTS
Theoretically it doesn't really matter how do you encode your inputs. As long as different samples will be represented by different points in the input space it is possible to separate them with a line - and that what's the input layer (if it's linear) is doing - it combines the inputs linearly. However, the way the data is laid out in the input space can have huge impact on convergence time during learning. A simple way to see this is this: imagine a set of lines crossing the origin in the 2D space. If your data is scattered around the origin, then it is likely that some of these lines will separate data into parts, and few "moves" will be required, especially if the data is linearly separable. On the other hand, if your input data is dense and far from the origin, then most of initial input discrimination lines won't even "hit" the data. So it will require a large number of weight updates to reach the data, and the large amount of precise steps to "cut" it into initial categories.
OUTPUTS
If you have categories then encoding them as binary is quite important. Imagine that you have three categories: A, B and C. If you encode them with two three neurons as 1;0;0, 0;1;0 and 0;0;1 then during learning and later with noisy data a point about which network is "not sure" can end up as 0.5;0.0;0.5 on the output layer. That makes sense, if it is really something conceptually between A and C, but surely not B. If you'd choose one output neuron end encode A, B and C as 1, 2 and 3, then for the same situation the network would give an input of average between 1 and 3 which gives you 2! So the answer would be "definitely B" - clearly wrong!
Reference:
ftp://ftp.sas.com/pub/neural/FAQ.html
So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.
I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).