String Matching Using Recurrent Neural Networks - machine-learning

I have recently started exploring Recurrent Neural Networks. So far I have trained character level language model on tensorFlow using Andrej Karpathy's blog. It works great.
I couldnt however find any study on using RNNs for string matching or keyword spotting. For one of my project I require OCR of scanned documents and then parsing the converted text for key data points. Most string matching techniques fail to incorporate the OCR conversion mistakes and that leads to significant error.
Is it possible to train the RNN on the variations of converted text I receive and use it for finding keywords.

This paper may the thing you are looking for:
[1608.02214] Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network
A Brief introduction:
The author of this paper demonstrated a method to recognize jumbled words which like Cmabrigde Uinervtisy(Cambridge University). Training the neural network with correct begin, end characters and the encoded internal characters which doesn't contain it's position information, the neural network can learn to recognize and correct it.
You can easily modify the network structure to adapt your own need, the OCR, as you had mentioned.
(source: firefoxusercontent.com)
(source: firefoxusercontent.com)

Related

Natural language generation evaluation

I was making a natural language generator using LSTM networks but now I am stuck in the part , how to evaluate my output. Suppose i have a input training data-set that consists of a dialogue act representation and the correct output for that particular dialogue act. Now suppose i generate a output sentence y from my LSTM network, so how to evaluate that sentence in comparison to the one in the data-set. I mean is there any way to compare output so that I can use gradient descent to train my weights.
As soon as you find the answer, you'll be able to write a nice paper about it since that's kind of an open research question right now. :)
To my best knowledge, your evaluation has to combine syntactic and semantic plausibility of the output, context-coherence, personality consistency and discourse dynamic progression. There's no consensus on how to optimally measure these, but there's plenty of current papers on the topic.
Related introductory read by Liu et al: https://arxiv.org/abs/1603.08023

LSTM for vector to character sequence translation

I'm looking to build a sequence-to-sequence model that takes in a 2048-long vector of 1s and 0s (ex. [1,0,1,0,0,1,0,0,0,1,...,1] ) as my input and translating it to my known output of (a variable length) 1-20 long characters (ex. GBNMIRN, ILCEQZG, or FPSRABBRF).
My goal is to create a model that can take in a new 2048-long vector of 1s and 0s and predict what the output sequence will look like.
I've looked at some github repositories like this and this.
but I'm not sure how to implement it with my problem. Are there any projects that have done something similar to this/how could I implement this with the seq2seq models or LSTMs currently out there? (python implementations)
I am using the keras library in python.
Your input is strange as it is an binary-code. I don't know whether the model will work well.
First of all, you need to add start and end marks for your input and output which indicates the boundaries. Then design regional module of each time step, including how to use hidden state. You could try simple GRU/LSTM networks as following.
For details, you could try Encoder
and Decoder
In addition, you could take a look at Attention mechanism in paper Neural Machine Translation by Jointly Learning to Align and Translate. And the structure is as following.
For details
Though you are using Keras, I think it will be helpful to read PyTorch codes as it is straightforward and easy to understand. The tutorial given in PyTorch tutorial

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Embeddings with recurrent neural networks

I am working on a research project on text data (it's about search engine queries supervised classification). I have already implemented different methods and I have also used different models for the text (such as binary vectors of the dimention of my vocabulary - 1 if the i-th word appears in the text, 0 otherwise - or words embedding with the model word2vec).
My advisor told me that maybe we could find another representation of the queries using Recurrent Neural Network. This representation should keep into account the sequentiality of the words in the text thanks to the recurrence relation. I have read some documentation about RNN but I haven't find anything useful for this goal. I have read lot of things about language modelling (which predict probabilities of the words), but I don't understand how I could adapt this model in order to obtain something like an embedded vector.
Thank you very much!
Usually, if one wants to obtain embeddings from a query or a sentence exploiting RNN, the logits are used. The logits are simply the output values of the network after the forward pass of the full sentence/query.
The logit values produce a vector that has the dimensions of the output layer (i.e. number of the target classes): usually, it is the vocabulary, since they are extracted from a language model.
For hints have a look at these:
http://arxiv.org/abs/1603.07012
How does word2vec give one hot word vector from the embedding vector?
Note that in principle one could use also use bidirectional networks or networks trained on other tasks, obtaining smaller embeddings, even if this last option is kind of fancy and it has not been explored up to my knowledge.

Unsure about word embeddings, POS re: using a Neural Net for NLP classification.

Im planning on using an NN for sarcasm detection on a number of tweets. Im unsure of how to prepare the word embeddings I will train the NN on. If I tokenize the strings and tag emoticons, capitalisation, user tags, hashtags etc, how do i then combine the resulting strings with word embeddings? do i train the word embeddings on the resulting corpus of tweets?
You can start by reading some papers on sarcasm detection in twitter, e.g. Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon, which uses patterns of content words and high frequency words, or closer to your question Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic
Meaning of Words which uses word2vec. The latter views the sarcasm detection problem as disambiguation problem of literal and sarcastic meanings of the same word. Perhaps you can employ this approach using the recently published sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.
Try to use the techniques used in the papers, and when you encounter a specific problem ask a question with a minimal working example.

Resources