Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Natural language parsers (e.g. Stanford parser) output the syntactic tree of a sentence and perhaps a list of POS-tagged words and a list of dependencies. For example for the sentence:
Bell, based in Los Angeles, makes and distributes electronic, computer and building products.
Stanford parser correctly infers that the subject of "makes" is "Bell" and the object is "products" while "electronic", "computer", "building" are all modifiers of "products". Similarly with "distributes". A natural next step would be to use this information to construct a list of relations like this:
Bell makes electronic products
Bell makes computer products
Bell makes building products
Bell distributes electronic products
... and a couple more. More generally, I want a list of relations of the form:
subject - action - object
for the sake of concreteness let's assume that action can only be a single verb, subject - a noun and object - a noun with optional adjectives prepended. Obviously transformation from a parsed sentence to a set of such relations will be lossy but the result lends itself much more easily to further machine processing than a raw syntactic tree. I know how to extract these relations by hand when I see a parsed sentence, this question asks: how to do this automatically? Is there a known algorithm that does something similar? If not, how should I approach building one?
I'm not asking for a parser recommendation, this is a task on top of parsing. This task seems to me
very useful
not quite trivial
much simpler than parsing itself
and as such I would imagine people would have done it already many times. Unfortunately I wasn't able to find anything even close.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers )?
Or is it ok to keep them in the dataset since they only count for 1-5% ?
I think what you want is Topic Modeling, or perhaps a Text Rank algo, or certainly something along those lines. Check out the link below for some ideas of where to go with this.
https://monkeylearn.com/keyword-extraction/
There are numerous weaknesses with the bag of words model, especially when applied to natural language processing tasks, that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information. Bag of words simply refers to a matrix in which the rows are documents and the columns are words. The values matching a document with a word in the matrix, could be a count of word occurrences within the document or use tf-idf. The bag of words matrix is then provided to a machine learning algorithm. Using word counts or tf-idf, we are only able to identify key single word terms in a document.
Also, see the link below.
https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd
You can find the accompanying sample data used in the example in that link, directly below.
https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/quora_sample.csv
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Given any two words I'd like to understand if there's some sort of taxonomy/semantic field based relationship. For example given the words "Dog" and "Cat" I'd like to have a model which can return words in which "Dog" and "Cat" match, for example some words that this model would return in this case could be "Animal", "Mammal", "Pet" etcetera.
Is there an open source pre-trained model that can do this out of the box requiring no training dataset beforehand?
Sounds like WordNet would be a good fit for this task. WordNet is a lexical database that organises words in a hierarchical tree structure, like a taxonomy, and contains additional semantic information for many words. See e.g. WordNet for "cat" here for a browser-based demo. A word that's one hierarchy level above another word is a so called 'hypernym'. The hypernym for cat is e.g. 'feline'. With WordNet in NLTK you can get the hypernyms of two words until you get the same hypernym.
For 'cat' and 'dog' the common hypernym is 'animal'. See example code here:
from nltk.corpus import wordnet as wn
wn.synsets('cat')
# output: [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), ...]
wn.synset('cat.n.01').hypernyms()
# output: [Synset('feline.n.01')]
wn.synset('feline.n.01').hypernyms()
wn.synset('carnivore.n.01').hypernyms()
wn.synset('placental.n.01').hypernyms()
wn.synset('mammal.n.01').hypernyms()
wn.synset('vertebrate.n.01').hypernyms()
wn.synset('chordate.n.01').hypernyms()
# output: 'animal'
wn.synsets('dog')
# output: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('pawl.n.01'), Synset('chase.v.01')]
wn.synset('dog.n.01').hypernyms()
wn.synset('domestic_animal.n.01').hypernyms()
# output: 'animal'
You ask for a machine learning solution in your question. A classical approach would be word vectors via Gensim, but they will not give you a clear common category based on a database created by experts (like WordNet), but just give you words that often occur next to your target words ("cat", "dog") in the training data. I think that machine learning is not necessarily the best tool here.
See example:
import gensim.downloader as api
model_glove = api.load("glove-wiki-gigaword-100")
model_glove.most_similar(positive=["dog", "cat"], negative=None, topn=10)
# output: [('dogs', 0.7998143434524536),
('pet', 0.7550237774848938),
('puppy', 0.7239114046096802),
('rabbit', 0.7165164351463318),
('cats', 0.7114559412002563),
('monkey', 0.6967265605926514),
('horse', 0.6890867948532104),
('animal', 0.6713783740997314),
('mouse', 0.6644925475120544),
('boy', 0.6607726812362671)]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm developing an IOS app that uses speech recognition. Since the state of the art does not provide good accuracy on recognizing single letters (random, not spelling).
I was thinking to use a set of words, one per alphabet letter, and recognize those words instead (it gives hugely improved accuracy).
In Italy, for instance, it is widely used a set of city names (for spelling purpose):
A - Ancona
B - Bari
C - Como
... and so on
My question is, an average person in USA, what set of words would use??
It is for instance the NATO alphabet? Or is there another set or sets (I could always work with a mix). The only thing I cannot do is to work with the complete English Corpus ;)
Thanks in advance,
As a pilot I would recommend the standard phonetic alphabet:
A - Alpha
B - Bravo
C - Charlie
etc.
So yes, the NATO Phonetic Alphabet.
Keep in mind though that the "average" person in the USA doesn't know this alphabet. But most would know what you meant if it's used though. The occasional time I've run into a non-pilot person trying to clarify a letter, people just make up a word that starts with the letter. There is no "standard" in the USA that non-pilots know.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm browsing web searching for english language grammar, but i found only few simple examples like:
s -> np vp
np -> det n
vp -> v | v np
det -> 'a' | 'the'
n -> 'woman' | 'man'
v -> 'shoots'
Maybe I don't realise how big this problem is because i tought that grammar has been formalised. Can somebody provide me a source for some expanded formal english grammar?
Have a look at the English Resource Grammar, which you can use with the LKB or PET
Not perfect, surely not complete, not ideal from the theoretical point of view. But the nicest one I found: http://www.scientificpsychic.com/grammar/enggram1.html
It would be huge. Probably not possible.
Human languages are interpreted by "analog" creatures in (generally) a very forgiving way, not by dumb digital machines that can insist that rules be followed. They do tend to have some kind of underlying structure, but exceptions abound. Really the only "rule" is to make it possible for others to understand you.
Even among biological languages, English would be about the worst possible choice, because of its history. It started probably as a pidgen of various different Germanic languages (with attendent simplifications), then had a large amount of French overlaid onto it after the Norman Conquest, then had bits and peices of nearly every language in the world grafted onto it.
To give you some idea of the scale we are talking about, let's assume we can consider dictionaries to be your list of terminals for a human language. The only major work that makes a passable stab at being comprehensive for English is the Oxford English Dictionary, which contains more than half a million entries. However, very few people probably know more than 1/10th of them. That means that if you picked out random words from the OED and built sentences out of them, most English speakers would have trouble even recognizing the result as English.
Different groups of speakers tend to know different sets of words too. So just about every user of the language learns to tailor their vocabulary (list of used terminals) to their audience. I speak very differently to my buddies from the "wrong side of the tracks" than I do with my family, and different still than I do here on SO.
Look at Attempto Controlled English : http://attempto.ifi.uzh.ch/site/
You may want to examine the work of Noam Chomsky and the people that followed him. I believe much of his work was on the generative properties of language. See the Generative Grammar article on Wikipedia for more details.
The standard non-electronic resource is The Cambridge Grammar of the English Language.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to design an edifact parser, I was planning on having one class to read the file, one class to map the data and then one other class to deal with data storage. The part where I am having a major problem is in how instances of those classes should communicate with each other. Any advice would be appreciated.
I don't see a need for the classes to communicate (pass messages) but would suggest some Strategy pattern be used.
You'll have a class to read the file and make sense of it's Syntax. For example, something which can handle whitespace and return formatted information like 'token', 'word' etc.
The class which reads and parses syntax is passed into the Semantic parser. The Semantic parser makes sense of the meaning. For example you might expect "Id, command, token, string" in that order. The Semantic parser might use a Command pattern.
The Semantic class outputs structured data, so is passed into your structure builder (builder pattern).
So your code might look like;
MyDataStructure = DataBuilder(SemanticParser(SyntaxParse(FileReader(filename))));
HTH