Given a sentence like:
Complimentary gym access for two for the length of stay ($12 value per person per day)
What general approach can I take to identify the word gym or gym access?
Is this a POS tagger for nouns?
One of the most widely used techniques for extracting keywords from text is TF-IDF of the terms. A higher TF-IDF score indicates that a word is both important to the document, as well as relatively uncommon across the document corpus. This is often interpreted to mean that the word is significant to the document.
One other method is using lexical chains. I refer you to this paper for full description.
There are many other approaches out there that you can explore to use depending on your domain. A short survey can be found here.
Noun POS tags are not sufficient. For your example, "length of stay" is also a noun phrase, but might not be a key phrase.
Related
I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.
I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence.
Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well.
I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.
I'm not sure if I fully understand what your question is exactly.
Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example).
And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.
There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml
Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.
Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.
I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies