I'm a complete beginner in character recognition as well as machine learning in general.
I want to write a program which is able to process the following input:
A Chinese character (in either pixels of vector format), for example:
The decomposition of the previous character, ie for the example above:
and and the information that they are aligned horizontally.
The decomposition of a Chinese character is always 3 things: 2 other characters and the pattern describing how the 2 character form the initial character (it is called the compoisition kind). In the example above the composition kind is "aligned horizontally".
Given such an input, I want my program to tell which pixels or which contours in the initial character belongs to which subcharacter in its decomposition.
Where to start?
Well, I can't say that I provide a full answer but think about:
1) Reading the papers on how Google Translate app works. You know, when you point your iPhone's camera at text and it instantly translates the text (even preserving the fonts!). It supports the chineese language so it would be interesting for you to see if they solved similar task and how they did it
2) Another big question to answer - how to prepare your input data. You will need to provide at least some input data - i.e. decomposition of at least some characters. Try to do this manually for couple of characters and try to formalize what exactly you are doing - this will help you to better formulate what exactly you want your algorithm to do.
3) Try to use some deep neural net with your data from #2. Use something with convolution layers. Pre-train it with RBM (restricted boltzmann machine). After that - just take a really close look into the resulting neural network. Don't expect to get any good results, but looking into the ANN layers will help you to understand what the net have learned from data and might provide some insight into where to move next
Related
I am studying machine translation right now and I am interested in a question probing a bit more deeply into the internals of sentence representations.
Suppose we train an encoder-decoder Seq2Seq En-Fr translation system on parallel corpora, starting with pre-trained Eng and Fr word vectors. The system can use anything to form the sentence embedding (Transformers, LSTMs, etc). Then the job of the Seq2Seq translation system is to learn to build Eng sentence representations from Eng word vectors and learn to build French sentence representations from French word vectors and by the linking of the encoder and decoder, learn those two sentence representations in the same space.
After training the model, and encoding some English sentence with the model (Say, "This is not a pipe."), the sentence embedding in the joint representation space has some idea of the words 'this', 'is', 'not', 'a', 'pipe', etc and all their associations as well as the sequence in which they appear. (1)
When the decoder is run on the encoding, it is able to take out the aforementioned information due for a load of corpora that was fed to it during training and statistical associations between words, and output, correspondingly, 'Ceci', 'n', ''', 'est', 'pas', 'une', 'pipe', '(EOS)'. At each step, it extracts and outputs the next French word from the decoder hidden state and transforms it so that the heuristically "most prominent" word to be decoded next can be found by the decoder, and so on, until '(EOS)'.
My question is this: Is there any interpretation of the last decoder hidden state after (EOS) is the output? Is it useful for anything else? Of course, an easy answer is "no, the model was trained to capture millions of lines of English text and process them until some word in conjunction with the hidden state produces (EOS) and last decoder hidden state is simply that, everything else not explicitly trained on is just noise and not signal".
But I'm wondering if there's anything more to this? What I'm trying to get at is, if you have a sentence embedding generated in English, and have the meaning dumped out of it in French by the decoder model, does any residual meaning remain that is not translatable from English to French? Certainly, the last hidden state for any particular sentence's translation would be very hard to interpret, but how about in the aggregate (like some aggregation of the last hidden states of every single sentence to be translated that has the words 'French' in it, which means something slightly different in English because it can be paired with 'fries' etc. This is a silly example, but you can probably think of others exploiting cultural ambiguities, etc, that turn up in language.) Might this last embedding capture some statistical "uncertainty" or ambiguity about the translation (maybe of like the English possible "meanings" and associations that could have ended up in French but didn't?) or some other structural aspect of the language that might be used to help us understand, say, how English is different from French?
What category do you think the answer to this fall in?
"There is no signal",
"There probably is some signal but it would be
very hard to extract because of depends on the mechanics of how the
model was trained"
"There is a signal that can be reliably extracted,
even if we have to aggregate over millions of examples"?
I'm not sure if this question is sensical at all but I'm curious about the answer and if any research been done on this front? I ask out of plain simple curiosity.
Notes:
I am aware that the last hidden state exists because it generates (EOS) in conjunction with the last word. That is its purpose, nothing else (?) makes it special. I'm wondering if we can get any more meaning out of it (even if it means transforming it like applying the decoder step one more time to it or something).
(1) (Of course, the ML model has no rich ides of 'concepts' as a human would with all its associations to thoughts and experiences and feelings, to the ML model the 'concept' only has associations with other words seen in the monolingual corpus for the word vector training and the bilingual corpus for translation training.)
Answering my own question but still interested in thoughts. I have a hunch the answer is "no", because the hidden state embedding is generated with only two properties in mind: (1) To be 'closest' by cosine distance to the next output token out of all tokens in French, and (2) to produce the hidden state corresponding to the next word when the decoder transformation is applied to it. To make the last hidden state have an interpretation other than 'it is the point on the 300-d (or whatever dimension embedding we are using) unit circle closes by cosine distance to the French (EOS) token' would mean we would have apply (2) to it. But the training data never had any examples of anything following (EOS) so what we get if we apply the decoder transformation to the last hidden state was never learned and is simply random depending on our model initialisations.
If we wanted to get some sort of idea about how good a 'match' the English and French joint embedding space is, we should be looking and comparing the test loss of various translations, not looking into the last hidden state. But still interested in people's thoughts on the matter if anyone thinks differently.
Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782
For some self-studying, I'm trying to implement simple a sequence-to-sequence model using Keras. While I get the basic idea and there are several tutorials available online, I still struggle with some basic concepts when looking these tutorials:
Keras Tutorial: I've tried to adopt this tutorial. Unfortunately, it is for character sequences, but I'm aiming for word sequences. There's is a block to explain the required for word sequences, but this is currently throwing "wrong dimension" errors -- but that's OK, probably some data preparation errors from my side. But more importantly, in this tutorial, I can clearly see the 2 types of input and 1 type of output: encoder_input_data, decoder_input_data, decoder_target_data
MachineLearningMastery Tutorial: Here the network model looks very different, completely sequential with 1 input and 1 output. From what I can tell, here the decoder gets just the output of the encoder.
Is it correct to say that these are indeed two different approaches towards Seq2Seq? Which one is maybe better and why? Or do I read the 2nd tutorial wrongly? I already got an understanding in sequence classification and sequences labeling, but with sequence-to-sequence it hasn't properly clicked yet.
Yes, those two are different approaches and there are other variations as well. MachineLearningMastery simplifies things a bit to make it accessible. I believe Keras method might perform better and is what you will need if you want to advance to seq2seq with attention which is almost always the case.
MachineLearningMastery has a hacky workaround that allows it to work without handing in decoder inputs. It simply repeats the last hidden state and passes that as the input at each timestep. This is not a flexible solution.
model.add(RepeatVector(tar_timesteps))
On the other hand Keras tutorial has several other concepts like teacher forcing (using targets as inputs to the decoder), embeddings(lack of) and a lengthier inference process but it should set you up for attention.
I would also recommend pytorch tutorial which I feel is the most appropriate method.
Edit:
I dont know your task but what you would want for word embedding is
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
Before that, you need to map every word in the vocabulary into an integer, turn every sentence into a sequence of integers and pass that sequence of integers to the model (embedding layer of latent_dim maybe 120). So each of your word is now represented by a vector of size 120. Also your input sentences must be all of the same size. So find an appropriate max sentence length and turn every sentence into that length and pad with zero if sentences are shorter than max len where 0 represents a null word perhaps.
I am working on handwriting recognition and related stuff on visual studio platform and using openCV libraries. Input is in the form of binary scanned .tif images.
Currently I went into a roadblock trying to figure out a way to recognize striked out words as in you strike out (cancel) words using a straight/ curved line. I am not going to do individual character recognition 'coz that will be a waste of computation power.
Is there any way to recognize such occurrences in an alternate way?
Following are two ideas I've come upon but I am not sure -
1> use a mask like < 0 0 0 , 1 1 1, 0 0 0 > that will help in finding all horizontal lines... but this will be a very big assumption. the lines can be wavy and in any orientation.
2> skeletonize the input and look for intersections. this will give me quite a few intersections - including those that occur due to the line used to strike out the word. using some approximation like least squares etc. i can get an approximate line. but there's the problem that intersections can occur at many places - eg. 2 intersections in 'b' etc.
any suggestions?
Have you considered using the Hough transform to detect the strike lines?
Here's an illustration of the use of hough transform in handwriting, that will give you the intuition of the approach:
You can quickly test it with openCV. The function is called cvHoughLines2.
Why not processing contours? you could take advantage of Poly (Ten-Chin) approximation and analyze only the few vectors resulting from the chain reconstruction. If you want to do more, then use a mixed pyramid/contour scheme, in order to get vectors approximations with different Level of Detail, starting from rough resolution up to finest.
Stop the refinement when you get a "reasonable" number of unique segments, apply normalization (see Moments - Hu's Moments) to make a fingerprint of your sample, and finally adopt a strong classification system.
I suggest you to look at ML (Machine Learning) part of OpenCV suite, for better reference on this latter part. For raster data, Haar's wavelets + Hidden Markovian Models work well, for vectors maybe you could use something less hard to setup (SOM, KNN, KMeans).
I would go with the individual character recognition. It may be a waste of computing power but it could give the best results. Just find a way to get a value from the character recognition that shows how good the character was recognized, then find a threshold for things that aren't characters. I think the canceling will destroy the char in a way that the recognition will have it problems finding something and maybe you can use this fact to find the canceled characters. To improve the results look for many characters that are badly recognized in the same region of the text, often whole words are canceled and therefore the bad recognition results will cluster.
If your performance is very bad in the end you can always come back and improve the algorithm later on.
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies