I'm looking for a method to represent proper nouns into vectors and correct misspellings.
For example, I'd have a database of proper nouns (such as James, Rebecca, Michael, etc...) and would like to map these names into vectors.
I'd also have a set of entries with misspellings of these names (e.g. Rebeca, Mikel etc...) and would like to also map these into vectors.
The objective would be to use a similarity measure between the vector of the misspelled name with each vector of correctly spelled names and identify the correct name.
I cannot find any NLP method which deals with this kind of problem.
Thank you!
So the goal is spelling correction? And you have no context, just the words? I suggest using kmer distance. That is to say, for suitable values of k, each word is represented by the set of substrings of length k. The distance between words is the sqrt(1-J) where J is the Jaccard similarity of these sets. Build a nearest neighbour tree of the words. Then the suggested correction is the nearest neighbour of the misspellt word.
You should choose values for k by experiment, but {3,4,5} would be a good starting point.
There are alternatives to the formula sqrt(1-J), but this formula has the advantage that it is the natural metric for the RKHS induced by Jaccard similarity.
Related
I have been trying to implement the Rocchio algorithm and I understand the basic idea behind the algorithm but I struggle to put it into concrete terms. I calculated tf_idf before and that is a vector of length of the number of query terms we search for each document that contains at least one of the query terms. But now, I feel like I cannot represent the document as a vector in the space formed just by the query terms because that will not allow me to "discover" other terms that the relevant documents have in common. Should I then represent the vector of the query and vectors of the documents in a vector space of all the tokens found in the currently returned set of documents?
Blockquote yes the dimension of the vectors (both docs and queries) is the vocabulary size of the collection... so these vectors are extremely sparse (most entries being zeroes)...
Yes, as #Debasis said this was the correct answer.
From my understanding, word vectors are only ever used in terms of relations to other word vectors. For example, the word vector for "king" minus the word vector for "boy" should give a vector close to "queen".
Given a vector of some unknown word, can assumptions about the word be made based solely on the values of that vector?
The individual coordinates – such as dimension #7 of a 300-dimensional vector, etc – don't have easily interpretable meanings.
It's primarily the relative distances to other words (neighborhoods), and relative directions with respect to other constellations of words (orientations without regard to the perpendicular coordinate axes, that may be vaguely interpretable, because they correlate with natural-language or natural-thinking semantics.
Further, the pre-training initialization of the model, and much of the training itself, uses randomization. So even on the exact same data, words can wind up in different coordinates on repeated training runs.
The resulting word-vectors should after each run be about as useful with respect to each other, in terms of distances and directions, but neighborhoods like "words describing seasons" or "things that are 'hot'" could be in very different places in subsequent runs. Only vectors that trained together are comparable.
(There are some constrained variants of word2vec that try to force certain dimensions or directions to be more useful for certain purposes, such as answering questions or detecting hypernym/hyponym relationships – but that requires extra constraints or inputs to the training process. Plain vanilla word2vec won't be as cleanly interpretable.)
You cannot make assumptions about the word based on the values of its word vector. A single word vector does not carry information or meaning by itself, but only contains meaning in relation to other word vectors.
Word vectors using algorithms such as Word2Vec and GloVe are computed and rely on the co-occurrence of words in a sequence. As an example, Word2Vec employs the dot product of two vectors as an input to a softmax function, which approximates the conditional probability that those two words appear in the same sequence. Word vectors are then determined such that words that frequently occur in the same context are mapped to similar vectors. Word vectors thereby capture both syntactic and semantic information.
I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
TF-IDF computation
TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.
Similarity computation
Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.
If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.
I hope it helps.
Best of luck :)
Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.
For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).
BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)
I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.
I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.
However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.
I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?
EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.
I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.
I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.
Peter Kleiweg's Rug/L04 (for computational dialectology) includes an implementation of Levenshtein distance which allows you to specify non-uniform insertion, deletion, and substitution costs.