What does the multiple outputs in skip-gram mean? - machine-learning

I've been trying to understand the process of skip-gram learning algorithm. There's this small detail that confuses me.
In the following graph(which is used in many articles and blogs to explain skip-gram), what does the multiple outputs mean? I mean, the input word is the same, the output matrix is the same. Then when you calculate the output vector, which I believe is the probability set of all words appearing near the input word, it should be the same all the time.
skipgram model
Hope someone can help me with this~

This article seems to explain it adequately — each "chunk" of the output represents the prediction of a word at one position in the context (the window of words before and after the input word in the text). The output is "really" a single vector, but the diagram is trying to make it clear that it corresponds to C instances of a word-vector where C is the size of the context.

It's kind of a prone-to-misinterpretation diagram. Each of the three outputs in that diagram should be considered the results for a different input (context) word.
Feed it word 1, and through the hidden layer, to the output layer, you'll get (size-of-vocabulary) V output values (at each node, assuming the easier-to-think-about negative-sampling mode) – the top results in the diagram. Feed it word 2, and you'll get the middle results. Feed it word 3, and you'll get the bottom results.

Related

Doc2Vec : Paragraph matrix (D) in the structure of the PV-DBOW model

I am confused about the meaning of paragraph matrix (D) in the structure of the PV-DBOW model (Doc2Vec).
Is the paragraph matrix the result of one-hot encoding of n input paragraph IDs?
Or is the paragraph matrix a randomly initialized weight to generate the shape(nxp), where n is the number of paragraph ID inputs and p is the vector dimension?
I've never found the original paper's diagrams very clear or helpful. (And, no one has been able to reproduce their claimed results in the 'concatenation' mode.)
But I can say, from familiarity with code implementations:
At the outset, every paragraph-ID gets a randomly-initialized vector. Thus, there is a matrix in the model with all of these vectors, of shape number_of_paragraph_ids x number_of_dimensions
For backprop-training in PV-DBOW, each individual paragraph-vector (one row from the above matrix) is adjusted to better predict the matching-paragraph's individual constituent words.
While figuratively, that's sort of a 1-hot selection of a single paragraph choice, in the code it's just a lookup of the single correct row using a paragraph-ID key.

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

is ORB+BFMatcher a good match for recognizing repetitive images (with slight variations?)

I need to recognize images with hand-written numerals with known values. Physical objects with the number are always identical but come in slight variations of positions/scale/lighting. They are about 100 in number, having about 100x500 px in size.
In the first pass, the code should "learn" possible inputs, and then recognize them (classify them as being close to one of the "training" images) when they come again.
I was mostly following the Feature Matching Python-OpenCV tutorial
Input images are analyzed first, keypoints & descriptors are remembered in the orbTrained list:
import cv2
import collections
ORBTrained=collections.namedtuple('ORBTrained',['kp','des','img'])
orbTrained=[]
for img in trainingImgs:
z2=preprocessImg(img)
orb=cv2.ORB_create(nfeatures=400,patchSize=30,edgeThreshold=0)
kp,des=orb.detectAndCompute(z2,None)
orbTrained.append(ORBTrained(kp=kp,des=des,img=z2))
z3=cv2.drawKeypoints(z2,kp,None,color=(0,255,0),flags=0)
A typical result of this first stage looks like this:
Then in the next loop, for each real input image, cycle through all training images to see which is matching the best:
ORBMatch=collections.namedtuple('ORBMatch',['dist','match','train'])
for img in inputImgs:
z2=preprocessNum(img)
orb=cv2.ORB_create(nfeatures=400,patchSize=30,edgeThreshold=0)
kp,des=orb.detectAndCompute(z2,None)
bf=cv2.BFMatcher(cv2.NORM_HAMMING,crossCheck=True)
mm=[]
for train in orbTrained:
m=bf.match(des,train.des)
dist=sum([m_.distance for m_ in m])
mm.append(ORBMatch(dist=dist,match=m,train=train))
# sort matching images based on score
mm.sort(key=lambda m: m.dist)
print([m.dist for m in mm[:5]])
best=mm[0]
best.match.sort(key=lambda x:x.distance) # sort matches in the best match
z3=cv2.drawMatches(z2,kp,best.train.img,best.train.kp,best.match[:50],None,flags=2)
The result I get is nonsensical, and consistently so (only when I run with pixel-identical input, the result is correct):
What is the problem? Am I completely misunderstanding what to do, or do I just need to tune some parameters?
First, are you sure you're not reinventing the wheel by creating your own OCR library? There are many free frameworks, some of which support training with custom character sets.
Second, you should understand what feature matching is. It will find similar small areas, but isn't aware of other feature pairs. It will match similar corners of characters, not the characters itself. You might experiment with larger patchSize so that it covers at least half of the digit.
You can minimize false pairs by running feature detection only on a single digit at a time using thresholding and contours to find character bounds.
If the text isn't rotated, using roation-invariant feature descriptor such as ORB isn't the best option, try rotation-variant descriptor, such as FAST.
According to authors of papers (ORB and ORB-SLAM) ORB is invariant to rotation and scale "in a certain range". May be you should first match for small scale or rotation change.

How to train a neural network in forward manner and using it in backward manner

I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources