Approximate string matching with a letter confusion matrix?

Approximate string matching with a letter confusion matrix? - grep

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.
However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.
I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?
EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.

I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.
I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.

Peter Kleiweg's Rug/L04 (for computational dialectology) includes an implementation of Levenshtein distance which allows you to specify non-uniform insertion, deletion, and substitution costs.

Related

Looking for a method to map Proper nouns into vectors

I'm looking for a method to represent proper nouns into vectors and correct misspellings.
For example, I'd have a database of proper nouns (such as James, Rebecca, Michael, etc...) and would like to map these names into vectors.
I'd also have a set of entries with misspellings of these names (e.g. Rebeca, Mikel etc...) and would like to also map these into vectors.
The objective would be to use a similarity measure between the vector of the misspelled name with each vector of correctly spelled names and identify the correct name.
I cannot find any NLP method which deals with this kind of problem.
Thank you!

So the goal is spelling correction? And you have no context, just the words? I suggest using kmer distance. That is to say, for suitable values of k, each word is represented by the set of substrings of length k. The distance between words is the sqrt(1-J) where J is the Jaccard similarity of these sets. Build a nearest neighbour tree of the words. Then the suggested correction is the nearest neighbour of the misspellt word.
You should choose values for k by experiment, but {3,4,5} would be a good starting point.
There are alternatives to the formula sqrt(1-J), but this formula has the advantage that it is the natural metric for the RKHS induced by Jaccard similarity.

Do word vectors mean anything on their own?

From my understanding, word vectors are only ever used in terms of relations to other word vectors. For example, the word vector for "king" minus the word vector for "boy" should give a vector close to "queen".
Given a vector of some unknown word, can assumptions about the word be made based solely on the values of that vector?

The individual coordinates – such as dimension #7 of a 300-dimensional vector, etc – don't have easily interpretable meanings.
It's primarily the relative distances to other words (neighborhoods), and relative directions with respect to other constellations of words (orientations without regard to the perpendicular coordinate axes, that may be vaguely interpretable, because they correlate with natural-language or natural-thinking semantics.
Further, the pre-training initialization of the model, and much of the training itself, uses randomization. So even on the exact same data, words can wind up in different coordinates on repeated training runs.
The resulting word-vectors should after each run be about as useful with respect to each other, in terms of distances and directions, but neighborhoods like "words describing seasons" or "things that are 'hot'" could be in very different places in subsequent runs. Only vectors that trained together are comparable.
(There are some constrained variants of word2vec that try to force certain dimensions or directions to be more useful for certain purposes, such as answering questions or detecting hypernym/hyponym relationships – but that requires extra constraints or inputs to the training process. Plain vanilla word2vec won't be as cleanly interpretable.)

You cannot make assumptions about the word based on the values of its word vector. A single word vector does not carry information or meaning by itself, but only contains meaning in relation to other word vectors.
Word vectors using algorithms such as Word2Vec and GloVe are computed and rely on the co-occurrence of words in a sequence. As an example, Word2Vec employs the dot product of two vectors as an input to a softmax function, which approximates the conditional probability that those two words appear in the same sequence. Word vectors are then determined such that words that frequently occur in the same context are mapped to similar vectors. Word vectors thereby capture both syntactic and semantic information.

word2vec application to non-human words: implementation allowing to provide my own context and distance

I would like to use word2vec to transform 'words' into numerical vectors and possibly make predictions for new words. I've tried extracting features from words manually and training a linear regression model (using Stocahstic Gradient Descent), but this only works to an extent.
The input data I have is:
Each word is associated with a numerical value. You can think of this value as being the word's coordinate in 1D space.
For each word I can provide a distance to any other word (because I have the words' coordinates).
because of this I can provide the context for each word. If given a distance, I can provide all the other words within this distance from the target one.
Words are composed from latin letters only (e.g. AABCCCDE, BKEDRRS).
Words almost never repeat, but their structural elements repeat a lot within different words.
Words can be of different length (say 5-50 letters max).
Words have common features, some subsequences in them will occur multiple times in different words (e.g. some dublets or triplets of letters, their position within a word, etc).
The question:
Is there an implementation of word2vec which allows provision of your own distances and context for each word?
A big bonus would be if the trained model could spit out the predicted coordinate for any word you feed in after training.
Preferrably in Java, Python is also fine, but in general anything will do.
I am also not restricting myself to word2vec, it just seems as a good fit, but my knowledge of machine-learning and data mining are very limited, so I might be missing a better way to tackle the problem.
PS: I know about deeplearning4j, but I haven't looked around the code enough to figure out if what I want to do is easy to implement in it.
Example of data: (typical input contains thousands to tens of thousands of words)
ABCD 0.50
ABCDD 0.51
ABAB 0.30
BCDAB 0.60
DABBC 0.59
SPQTYRQ 0.80

Naive Bayes, not so Naive?

I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?

I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.

Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)

I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.

Levenshtein Distance Algorithm better than O(n*m)?

I have been looking for an advanced levenshtein distance algorithm, and the best I have found so far is O(n*m) where n and m are the lengths of the two strings. The reason why the algorithm is at this scale is because of space, not time, with the creation of a matrix of the two strings such as this one:
Is there a publicly-available levenshtein algorithm which is better than O(n*m)? I am not averse to looking at advanced computer science papers & research, but haven't been able to find anything. I have found one company, Exorbyte, which supposedly has built a super-advanced and super-fast Levenshtein algorithm but of course that is a trade secret. I am building an iPhone app which I would like to use the Levenshtein distance calculation. There is an objective-c implementation available, but with the limited amount of memory on iPods and iPhones, I'd like to find a better algorithm if possible.

Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).
If you can afford to approximate, there are poly-logarithmic approximations.
For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak

If you only want the threshold function - eg, to test if the distance is under a certain threshold - you can reduce the time and space complexity by only calculating the n values either side of the main diagonal in the array. You can also use Levenshtein Automata to evaluate many words against a single base word in O(n) time - and the construction of the automatons can be done in O(m) time, too.

Look in Wiki - they have some ideas to improve this algorithm to better space complexity:
Wiki-Link: Levenshtein distance
Quoting:
We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.

I found another optimization that claims to be O(max(m, n)):
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C
(the second C implementation)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart